1 Introduction

Smartphone security presents a critical challenge for both academia and industry experts. In a 2021 security report by Checkpoint, findings revealed a staggering 97% of organizations encountering mobile threats in 2020 [1]. Additionally, 46% of organizations reported incidents where employees unknowingly downloaded malicious mobile applications. The report also highlighted that 40% of global smartphone devices remain susceptible to cyber-attacks. Additionally, G DATA CyberDefense reported over 1.3 million new malicious smartphone apps in the first half of 2021 [2]. Android OS-based smartphones dominate the global market, favored by numerous manufacturers for their open-source nature, abundance of development libraries, and cost-effectiveness. In 2024, statistic reports revealed that Android boasts more than 3 billion active users, capturing a commanding 75% share of the global market [3]. The widespread adoption of Android-based smartphones renders them prime targets for cyber attackers, posing escalating security challenges. Despite extensive research in malware analysis, the proliferation of malicious payloads and applications persists, even within Google’s official app repository. According to Securelist, despite Google’s mitigation efforts, malicious behaviors persist within applications on Google Play [4]. Among the prevalent threats are the Joker Trojan, implicated in unauthorized paid subscriptions, the Facestealer Trojan, targeting Facebook credentials, and a range of banking Trojan loaders. Moreover, Kaspersky reported numerous Trojan loaders, including those for Joker and Facestealer malware, detected within Google Play apps in 2021 [5]. Another security report highlighted the detection of nearly 3.5 million malicious installation packages on mobile devices in the same year [6]. Furthermore, mobile phone security reports have indicated a nearly twofold increase in detected Trojans between 2020 and 2021 [7]. Additionally, attackers employ various code obfuscation techniques to camouflage malicious code, complicating detection processes. For instance, a malicious code injection-based obfuscation method proposed in [8], generates mutated versions of malware applications, challenging multiple commercial antivirus systems. Testing revealed a drastic decline in the performance of commonly used antivirus systems, with detection rates dropping to 10% or less when identifying the constructed mutated malware applications.

Deep learning, a subset of machine learning, has emerged as a powerful tool in various fields due to its capability to learn complex patterns from large volumes of data. Its applications span across diverse domains such as computer vision [9, 10], natural language processing [11, 12], healthcare [13], finance, cybersecurity, and more. In computer vision, deep learning models excel at tasks like image classification, object detection [14, 15], and image segmentation, revolutionizing industries from autonomous vehicles to medical imaging. Natural language processing tasks, such as sentiment analysis, language translation, image captioning [12], and chatbots, benefit greatly from deep learning’s ability to understand and generate human language. Moreover, in healthcare, deep learning is being utilized for disease diagnosis, personalized treatment plans, and drug discovery, leading to more accurate diagnoses and improved patient outcomes.

Despite the efforts to address these challenges, there remains a noticeable gap between the number of conducted works and the daily influx of new attacks, underscoring the urgency for innovative approaches to malware detection and mitigation. It is within this context that the present study seeks to contribute to the ongoing efforts to enhance smartphone security through the development of advanced malware detection techniques. Motivated by the discrepancies between the number of conducted works and the relentless pace of new attacks released on a daily basis, our research aims to bridge this gap by proposing innovative solutions to detect and mitigate emerging threats effectively.

2 Related works

Numerous works have emerged in the field of Android malware analysis and detection since 2010. These works can generally be classified into four trends: static analysis-based, dynamic analysis-based, hybrid analysis, and code visualization-based models [16]. In this section, we will discuss some works that fall under each of these trends.

In [17], DL-Droid a dynamic analysis-based deep learning model has been proposed for Android malware analysis and detection. The proposed dynamic method is based on stateful input generation, and it has been performed using more than 30,000 applications in real devices. In [18], multiple static analysis-base features such as Method API, Shared library function opcode, Permission, etc., have been extracted from a dataset containing 41,260 Android applications. The extracted features have been used for training a multimodal deep learning model proposed for Android malware detection purpose. In [19], MalDozer a malware analysis model has been proposed for automatic Android malware detection based on deep learning techniques. MalDozer is based on extracting the API Method Calls, replacing each API method with an identifier, and generating Semantic Vectors that have been used for training a deep-learning model. In [20], DroidCat, a dynamic analysis-based method has been proposed for detecting malware in the Android environment. Particularly, the proposed method is based on extracting method calls and inter-component communication (ICC) Intents dynamic features from Android apps and using them as a behavioural app profile for training machine learning multi-class classifier. In [21], four tree-based machine learning classifiers have been evaluated in terms of android malware classification alongside the substring-based feature selection method. It has been stated that the Random Forest classifier outperforms previously obtained classification results with classification accuracy reaching 97.24%. In [22], a machine learning-based android malware family classification framework called AndMFC has been proposed. The proposed framework is based on extracting two static analysis features namely requested permissions and API calls and using them for training multiple machine learning classifiers. In [23], an entropy-based behavioural analysis method called EntropLyzer has been proposed for classifying and characterizing Android malware categories based on multiple dynamic features includes memory, API, network, battery, and process. In [24], a deep learning techniques-based Android malware detection model called DeepAMD has been proposed. The proposed approach has been compared with the conventional machine learning algorithms and it has been stated that the proposed approach outperformed the previously proposed and conducted static and dynamic analysis-based works. In [25], a Graph Convolutional Network (GCN)-based Android malware detection approach called GDroid has been proposed. The proposed approach is based on converting the app’s APIs into a large heterogeneous graph, and feeding it into a Graph Convolutional Network model used for achieving the classification task. In [26], VisDroid, a multi-class classification approach has been proposed for classifying Android malware into their families based on some image-based global and local features. The extracted image-based features have been used for training a new machine learning algorithms-based voting classifier. It is stated that the results of the proposed model outperformed the state-of-the-art models’ results. Also, in [27], the android apps’ source codes have been converted to grayscale images and used for training multiple machine learning algorithms used for distinguishing between Android benign and malicious wares. In [28], DeepVisDroid, which is an image-based deep learning model has been introduced. Particularly, the Android apps have been converted to images, and some image-based global and local features have been extracted and used for training a one-dimensional CNN model used for classifying Android apps as benign or malicious. In [29], a new feature extraction method called DroidEncoder has been proposed, based on the auto-encoder structure, for Android malware detection. The method uses an image-based Android app dataset with 3000 malicious and 3000 benign apps. Three auto-encoders are proposed, and experiments are conducted to train multiple machine learning algorithms. Cross-validation and multiple metrics evaluations show superior performance in all metrics. Yilmaz et al. [30], developed a machine learning-based Android malware detection system. Data was balanced using SMOTE, SMOTETomek, and ClusterCentroids methods, and optimized using various feature selection approaches. The most successful methods were tuned using GridSearch, Random Search, and Bayesian Optimization algorithms to investigate the effects of hyperparameter tuning on ML algorithm performance. Furthermore, authors in [31] explores the effectiveness of a genetic algorithm-based hyperparameter tuning mechanism and a hybrid feature selection approach in enhancing intrusion detection systems (IDSs). The study proposes a machine learning-based IDS approach for detecting attacks in IoT environments, using a hybrid feature selection method and genetic algorithm fine-tuning. The results show that hyperparameter optimization can enhance the accuracy and efficiency of machine learning-based IDS systems for IoT networks. The study’s empirical nature provides a comprehensive analysis of the effectiveness of the proposed techniques. Table 1 provides a concise overview of various works utilizing different methodologies in this domain.

Table 1 A concise overview of various works conducted in Malware detection domain

This work introduces VoteDroid, a novel ensemble voting classifier built upon fine-tuned deep learning models. The model aims to optimize android malware detection by fine-tuning multiple deep learning models using an optimization algorithm. Specifically, the random search algorithm is utilized to select the best structures for three deep learning-based models: CNN-ANN, pure CNN, and pure ANN. Our approach suggests potential components for each model, allowing the random search algorithm to decide on the number and location of these components within the final model. This involves optimizing multiple hyperparameters, including the number of convolutional layers, filters per convolutional layer, presence and placement of MaxPooling and BatchNormalization layers, number of dense layers, neurons per dense layer, activation functions, weight initializers, presence and placement of dropout layers, and learning rate. We proposed converting DEX codes from android applications into grayscale images for tuning, training, and testing the models used in the final VotDroid model. After individually training and testing the fine-tuned deep learning models, we hybridize them to form an ensemble voting classifier operating in two modes: MMR and LMR. To our knowledge, this is the first instance of fine-tuning and hybridizing an ensemble voting classifier in this manner for malware classification tasks. Our proposed models achieve high classification accuracy exceeding 97% in standalone and ensemble testing experiments.

3 Material and method

3.1 Android app architecture

Typically, Android applications are distributed as APK files, akin to ZIP archives, containing the application code in DEX format, along with native libraries, resources, assets, and more. Upon extraction of an APK file, the following folders and files are commonly obtained:

  • AndroidManifest.xml: Houses metadata such as permissions, version number, package name, and app components’ details.

  • classes.dex: Contains the application’s code in .dex format.

  • Assets: Stores the application’s assets.

  • resources.arsc: Holds compiled resources like colors, styles, and strings.

  • META-INF: Contains information regarding the application’s certificate and signature.

  • Lib: Contains the compiled format of native libraries used by the application.

  • Res: Houses resources not compiled into resources.arsc.

In this study, we leverage DEX code files extracted from Android applications. These files are transformed into grayscale image representations to facilitate the training, tuning, and evaluation of the proposed VotDroid model.

3.2 Deep learning techniques

3.2.1 Artificial neural network (ANN)

ANN, or Artificial Neural Network, emulates biological neural networks found in the brain through a collection of mathematical computation units called neurons. Each neuron receives input from previous neurons, performs mathematical computations, and passes its output to subsequent neurons. Neurons are typically organized into stacked layers known as dense layers, comprising input, hidden, and output layers. The input vector enters the input layer, traverses multiple hidden layers where various mathematical operations are applied, including both linear and non-linear transformations. Each connection between neurons is assigned a coefficient, commonly referred to as a weight. During training, the goal is to adjust these weights so that the model’s output closely resembles the expected output. Training involves feeding input data samples through the network, predicting outputs, computing the difference between predictions and ground-truth outputs using a loss function, and optimizing the model’s coefficients using an optimization algorithm.

3.2.2 Convolutional neural network CNN

CNNs, or Convolutional Neural Networks, are extensively used in computer vision and image processing to enhance accuracy and reduce computational complexity. Unlike traditional ANNs that convert input images into vectors, CNNs employ multiple convolutional and down-sampling operations to highlight important information and remove noise, resulting in a dimensionally reduced and information-rich feature map. Comprising convolutional, pooling, and fully connected (or dense) layers, CNN architectures are structured to efficiently extract key features from input images through convolutional layers, reduce feature map dimensions via pooling layers to minimize computational overhead, and make final decisions about input data samples using dense layers.

In this work, we chose the ANN and CNN deep learning architectures to be hybridized and used in Android malicious code detection. The selection of Artificial Neural Network (ANN) and Convolutional Neural Network (CNN) architectures for the detection of malicious code in Android applications is underpinned by their innate capabilities in handling complex, unstructured data. Notably, the versatility of ANN and CNN architectures allows for their application across various domains [32,33,34,35,36], contributing to their widespread adoption and effectiveness in achieving tasks beyond traditional machine learning paradigms [37, 38]. In the realm of Android security, where the identification of subtle patterns and anomalies amidst vast amounts of code is paramount, ANN and CNN architectures offer distinct advantages. ANNs, inspired by biological neural networks, excel in learning intricate relationships within data, making them well-suited for the nuanced task of discerning malicious behavior in Android apps. Furthermore, CNNs, with their hierarchical feature extraction and dimensionality reduction abilities, are adept at processing image data, a common representation of app code, enabling efficient analysis and detection of malicious patterns. By leveraging the inherent strengths of ANN and CNN architectures, we aim to enhance the efficacy of Android malicious code detection, ultimately bolstering the security of mobile ecosystems.

3.3 Constructed dataset

In this study, a dataset comprising 6000 Android APK archives was constructed for hyperparameter fine-tuning, model training, and testing purposes. This dataset consists of 3000 benign samples and 3000 malicious samples. The malicious samples were randomly selected from well-known Android malware datasets, including Drebin [39] and Malgenome [40]. Drebin contains 5560 malicious applications spanning 179 different malware families, while Malgenome encompasses over 1200 malicious Android applications, covering a wide range of existing malware families. The benign Android application dataset was sourced from the Google Play official repository using APKPureFootnote 1 online downloader. A Python script was developed to facilitate this process. The flowchart in Fig. 1 outlines the algorithm used in the Python code to construct the benign dataset. Initially, the Google Play repository was mined to retrieve the package names of the applications. Subsequently, the APK archives corresponding to the collected package names were downloaded using APKPure. A Python code was then employed to scan the downloaded APKs using the VirusTotalFootnote 2 online API. An Android application was classified as benign only if it had not been flagged as malicious by any of the more than 70 commercial antivirus engines integrated into the VirusTotal online service; otherwise, the APK archive was discarded.

Fig. 1
figure 1

The benign dataset construction algorithm

3.4 Hyperparameters fine-tuning

Deep learning models are influenced by two types of parameters: trainable parameters and hyperparameters. Trainable parameters, such as weights and biases, are optimized during the training process to enhance model performance. On the other hand, hyperparameters directly impact model performance but cannot be learned during training. Examples of hyperparameters include the number of layers, neurons per layer, and choice of activation function. Since these parameters cannot be learned during training, they must be selected by the model designer based on experience, heuristics, or through optimization algorithms. This process of adjusting hyperparameters is known as hyperparameter fine-tuning. Various algorithms can be employed for hyperparameter fine-tuning, and we will provide a brief overview of the most prominent ones in the following section:

  • Grid search: In this method, a grid encompasses all possible values within the search domain. Each collection of values within the grid is systematically tested, and the collection that yields the best results is selected for use in the desired model. While this approach exhaustively examines all potential value combinations, its computational overhead is considerable, rendering it computationally inefficient.

  • Random search: Random search, unlike Grid search, selects combinations of values randomly from the search domain to test the model, rendering it computationally efficient. While it may not always find the optimal hyperparameter set, random search often produces models with performance close to the ideal. This approach is typically favored when dealing with a high number of parameters to circumvent the computational overhead associated with Grid search.

  • Bayesian algorithm: Bayesian algorithm conducts a random search to identify a promising combination of values, subsequently focusing on that specific area of the search space rather than the entire domain. While computationally efficient, this approach may overlook other areas that could yield optimal results.

3.5 Proposed models

3.5.1 Tuning VotDroid components

In this study, the random search algorithm was utilized to fine-tune hyperparameters for three neural network models employed in Android malware detection. Various hyperparameters were adjusted, encompassing factors such as the number of CNN layers, filters within each CNN layer, presence of Pooling and BatchNormalization layers, number of Dense layers, neurons in each Dense layer, activation functions, dropout layer presence, and learning rate values. This approach aimed to optimize the performance of the neural network models proposed in this research. The random search algorithm was employed to fine-tune three distinct deep learning models. We outlined the potential components for each model and delegated the determination of optimal unit quantities for each component to the random search algorithm. The algorithm was configured to execute 20 trials, with 3 executions per trial spanning 5 epochs each, as detailed in Table 2. Fine-tuning multiple hyperparameters facilitated the selection of the top three models for optimal malware detection. Notably, the tuned hyperparameters varied based on the model’s component composition. The ranges of values used for hyperparameter tuning across the proposed models are summarized in Table 3.

Table 2 The configuration settings used for adopting the random search algorithm
Table 3 The range of values used for each tuned hyperparameter

Three deep learning models were fine-tuned using the random search algorithm with the aforementioned configuration. These models include the CNN-ANN, pure CNN, and pure ANN models, each with their hyperparameters adjusted accordingly. Figure 2 depicts the training and validation accuracy changes over 5 epochs for the best trial of each model. The CNN-ANN model integrates both convolutional and dense layers, with the specific architecture determined by the random search algorithm. This model may include various combinations of convolutional layers, MaxPooling layers, BatchNormalization layers, dropout layers, and dense layers, with the number and arrangement of these layers optimized by the algorithm. Additionally, hyperparameters such as the number of filters in each convolutional layer, number of neurons in each dense layer, activation functions, weight initializers, and learning rate were determined by the random search algorithm. Figures 3 and 4 provide visual representations of the hyperparameter tuning process for the CNN-ANN model using parallel coordinates and scatter plot matrix views, respectively. The pure CNN model, true to its name, comprises multiple convolutional layers followed by a dense classification layer for final output. The fine-tuning process for this model involved determining the optimal configuration of convolutional, MaxPooling, BatchNormalization, and dropout layers, guided by the random search algorithm. Additionally, the algorithm selected the number of filters for each convolutional layer, the activation function for each convolutional layer, and the weight initializer for each layer. Furthermore, the best learning rate value was determined by the random search algorithm. Figures 5 and 6 showcase parts of the parallel coordinates and scatter plot matrix views, respectively, illustrating the hyperparameter tuning process for the pure CNN model conducted by the random search algorithm. The third model consists solely of multiple dense layers without any convolutional layers. For the fine-tuning of this pure ANN model, we allowed for one or more dense layers and zero or more dropout layers, with the random search algorithm determining the optimal configuration of these layers. Additionally, the algorithm decided on the number of neurons in each dense layer, the activation function for each layer, and the weight initializer for each layer. Furthermore, the best learning rate value was determined by the random search algorithm. Figures 7 and 8 depict portions of the parallel coordinates and scatter plot matrix views, respectively, illustrating the hyperparameter tuning process for the pure ANN model conducted by the random search algorithm.

Fig. 2
figure 2

Training accuracy and validation accuracy during the 5 epochs of each execution in the best trail conducted during the models’ tuning process

Fig. 3
figure 3

A part of the parallel coordinate view of the random search algorithm's experiments during the tuning of the CNN-ANN model. The green line indicates to the best model

Fig. 4
figure 4

A part of the scatter plot matrix view of the random search algorithm's experiments during the tuning of the CNN-ANN model. The green point indicates to the best model

Fig. 5
figure 5

A part of the parallel coordinate view of the random search algorithm's experiments during the tuning of the pure CNN model. The green line indicates to the best model

Fig. 6
figure 6

A part of the scatter plot matrix view of the random search algorithm's experiments during the tuning of the pure CNN model. The green point indicates to the best model

Fig. 7
figure 7

A part of the parallel coordinate view of the random search algorithm's experiments during the tuning of the pure ANN model. The green line indicates to the best model

Fig. 8
figure 8

A part of the scatter plot matrix view of the random search algorithm's experiments during the tuning of the pure ANN model. The green point indicates to the best model

3.5.2 VotDroid final structure

In the preceding section, three hyperparameter fine-tuning processes were conducted using the random search algorithm to select three distinct deep learning models for classifying Android applications as malignant or benign. Specifically, the first model, CNN-ANN, comprises two phases: the initial phase incorporates a combination of convolutional, MaxPooling, BatchNormalization, and dropout layers, while the subsequent phase includes multiple Dense layers and dropout layers. These hyperparameters were fine-tuned, and the optimal values were selected to enhance the model’s ability to accurately detect malicious behavior. The refined structure and parameters of the CNN-ANN model are outlined in Table 4. The CNN-ANN model chosen through hyperparameter tuning process consists of three convolutional layers, two dense layers, two MaxPooling layers, three dropout layers of 0.25, and one dropout layer of 0.3. Additionally, the number of filters, activation function types, and weight initializers used in the convolutional layers are specified as follows: 64 filters with relu activation and uniform weight initializer, 224 filters with tanh activation and glorot_uniform weight initializer, and 128 filters with relu activation and uniform weight initializer, respectively.

Table 4 The structure of CNN-ANN model

The second tuned deep learning model is a pure CNN architecture with a single dense layer for classification. The optimal Pure CNN model identified by the random search optimization algorithm comprises three convolutional layers with 128, 32, and 32 filters, respectively, all employing relu activation functions. For kernel initialization, ‘lecun_uniform’ is utilized in the first convolutional layer, while ‘uniform’ is applied in the subsequent layers. Based on the optimization results, a BatchNormalization layer is recommended after the first convolutional layer. Additionally, a dropout layer of 0.25 is used after each convolutional layer, followed by a Flatten layer is used to convert the feature map into a vector. A dropout layer of 0.3 is then placed before the classification layer. Furthermore, the learning rate is set to 0.001. Detailed structural information for the tuned pure CNN model is provided in Table 5.

Table 5 The structure of the pure CNN model

The third tuned deep learning model was a pure ANN architecture. Hyperparameters of the ANN model, including the number of dense layers, neurons per layer, activation functions, kernel initializers, dropout layer presence, and learning rate, were fine-tuned using the random search algorithm. Following the random search process, the optimal hyperparameters were chosen for the proposed ANN model. The selected pure ANN model consists of two dense layers, with 64 neurons in the first layer and 32 neurons in the second layer. ‘glorot_normal’ is employed as the kernel initializer in the first dense layer, while ‘uniform’ is used in the second layer. Additionally, based on the results of the random search algorithm, ‘relu’ activation function is applied in both dense layers, and the learning rate is set to 0.01. Furthermore, a single dropout layer of 0.3 is positioned before the classification layer (the final layer in the model). The structural details of the proposed pure ANN model are outlined in Table 6.

Table 6 The structure of the PureANN model

Initially, the tuned CNN-ANN, pure CNN, and pure ANN models were individually tested for classifying Android applications as benign or malware. Subsequently, we proposed merging these three models to construct a fine-tuned ensemble voting classifier. This ensemble classifier leverages the predictions from the deep learning models and makes final decisions based on two modes: Malicious Minority Rule (MMR) and Label Majority Rule (LMR). Under MMR, an application is deemed malicious if at least one model labels it as such; otherwise, it’s labeled benign. Conversely, LMR assigns labels based on the majority prediction. Specifically, in MMR mode, an application is labeled benign only if all models label it as benign; otherwise, it’s labeled malicious. In LMR mode, an application is labeled benign if at least two of the models classify it as such; otherwise, it’s labeled malware. This approach represents a novel hybridization and fine-tuning of deep learning models for malware detection. Figure 9 illustrates the schematic diagram for the proposed VoteDroid ensemble voting classifier. The proposed VoteDroid consists of multiple phases: in the first phase, Android APK archives are extracted, and Dex codes are converted into grayscale images to construct the image dataset for training. In the second phase, the random search algorithm is employed to fine-tune hyperparameters and select optimal values for constructing deep learning models. This phase yields three distinct models: CNN-ANN, pure CNN, and pure ANN. These fine-tuned models are initially tested individually before being combined into an ensemble voting classifier to assess their collective performance in classifying Android applications.

Fig. 9
figure 9

The proposed VoteDroid ensemble voting model

4 Experimental results

All experiments in this study were conducted using Google Colab’s free GPU online service and a server equipped with an Intel Xeon Silver 4314 CPU @ 2.40 GHz (32 processors) and 150 GB of RAM. Specifically, Google Colab’s GPU was utilized for hyperparameter fine-tuning and selecting the best models for the research task. Subsequently, training, testing, and the construction of the proposed ensemble model were carried out using the aforementioned server. The dataset was divided into 80% for training, 10% for testing, and 10% for validation of the proposed models.

Several experiments were carried out using the three fine-tuned models: CNN-ANN, pure CNN, and pure ANN. Initially, each of these models was individually trained and tested to evaluate their performance in detecting Android malicious code. All experiments conducted on the proposed models were configured to run for 50 epochs with a batch size of 32. Additionally, an early stopping mechanism was implemented to monitor the models’ loss and halt the training process if there was no decrease in loss value for three consecutive epochs. Figure 10 depicts the variations in validation accuracy and loss throughout the training process for each model. Next, the three fine-tuned and trained deep learning models were combined in an ensemble approach to create a single ensemble voting classifier. This ensemble comprised the CNN-ANN, pure CNN, and pure ANN models, utilized for detecting and classifying Android applications as benign or malware. Our voting classifier operated in two modes: Malicious Minority Rule (MMR) and Label Majority Rule (LMR). Under the MMR mode, an application was labeled benign only if all voter models agreed; otherwise, it was labeled as malware. In the LMR mode, an application was assigned a specific class label if at least two voter models agreed on that label. Classification accuracy, precision, recall, and F1-score were utilized as evaluation metrics for the proposed models. Table 7 presents the results obtained by testing the individual fine-tuned models in standalone mode and using the ensemble voting classifier. It’s evident from the results table that all proposed models performed exceptionally well, achieving high performance of over 97%. The ensemble voting model yielded the best results, reaching 98% accuracy when tested in LMR mode, and 97.33% accuracy in MMR mode. Conversely, the standalone accuracies of the individual voter models were also impressive, with the CNN-ANN model, CNN model, and ANN model achieving classification accuracies of 97.67%, 97.33%, and 97.67% respectively. Moreover, Table 7 highlights the efficiency of both the individual sub model components of VotDroid and the VotDroid model itself in terms of detection time. Specifically, the detection time per sample for the CNN-ANN model, pure CNN model, pure ANN model, and the ensemble voting model were 0.0067, 0.0067, 0.0008, and 0.0142 s, respectively. Additionally, Fig. 11 displays the confusion matrices of the CNN-ANN model, pure CNN model, pure ANN model, and the ensemble voting model. From these confusion matrices, it’s evident that the CNN-ANN model failed to detect three malicious applications and misclassified eleven benign applications as malicious. One potential reason for this misclassification could be the complexity of the Android malware patterns captured by the visual features extracted from the converted source codes. It’s possible that certain subtle variations in the visual representations of benign and malicious applications were not effectively captured by the model’s architecture, leading to misclassifications. Similarly, the pure CNN model missed four malicious applications and misclassified twelve benign applications. One explanation for this misclassification could be the limited ability of the CNN architecture to capture higher-level abstract features from the converted source code images. The model may have struggled to distinguish between benign and malicious patterns encoded in the visual representations, resulting in misclassifications. On the other hand, the pure ANN model missed six malicious applications but reduced the misclassification of benign apps to eight. This discrepancy could be attributed to the inherent limitations of traditional ANN architectures in handling high-dimensional visual data effectively. The model may have struggled to generalize well to the complex and diverse visual patterns present in the source code images, leading to misclassifications.

Fig. 10
figure 10

The changes in the accuracy and loss during training the proposed models

Table 7 The detection results obtained using the tuned three deep learning models in standalone and ensemble modes
Fig. 11
figure 11

The confusion matrixes of the proposed fine-tuned models

Upon testing the fine-tuned ensemble voting classifier in MMR mode, the accuracy of malware detection improved, with only one malware application going undetected, albeit at the expense of an increase in false positives, which reached fifteen applications. The misclassifications observed in this mode may be due to the combined decisions of the individual models within the ensemble. Despite achieving higher overall accuracy in detecting malware, the ensemble classifier may have exhibited increased sensitivity to certain benign application patterns, resulting in a higher false positive rate. However, given that the primary concern is detecting malware, this trade-off is acceptable. In LMR mode, the ensemble classifier missed five malicious applications and incorrectly classified nine benign applications as malicious. Upon analyzing the false negative and false positive rates of the proposed VotDroid model, it’s evident that utilizing VotDroid in its MMR mode significantly reduces the false negative rate. Specifically, only one malware sample was incorrectly classified as benign by VotDroid, resulting in a false negative rate of just 0.0031. These results are highly promising, indicating that our VotDroid model can effectively detect malware in real-world scenarios with high accuracy and efficiency in terms of both accuracy and detection time. Figure 12 provides examples of detections conducted using the proposed VoteDroid model.

Fig. 12
figure 12

Some detection examples achieved using the proposed VoteDroid model

5 Comparison study

We conducted a comparative analysis of our proposed VoteDroid model with several previously conducted works in the field of Android malware detection, as summarized in Table 8. Our VoteDroid model achieved an accuracy of 98.0%, outperforming most of the compared models. However, one model in the comparison achieved a slightly higher accuracy of 98.4%. Upon further investigation of this particular model, we found that it benefitted from a substantially larger dataset comprising 400,000 applications compared to our dataset of 6,000 apps. Additionally, this model employed a dynamic analysis approach, which likely contributed to its higher accuracy compared to our visualization-based approach. While dynamic analysis can provide deeper insights into an application’s behavior during runtime, it often requires extensive computational resources and may not scale well to large datasets. Furthermore, dynamic analysis may struggle to capture certain types of malware that exhibit stealthy or dormant behavior, which could limit its effectiveness in certain scenarios.

Table 8 Performance comparison study

6 Conclusions and future works

The landscape of smartphone device security, particularly within the Android ecosystem, presents significant challenges, prompting extensive research efforts in both academic and business realms. Despite the surge in research activities since 2010, the pace of advancements does not adequately match the escalating volume of daily malicious payloads.

This study introduces VoteDroid, a novel ensemble voting detector, refined through the adoption of optimization algorithms to fine-tune multiple deep-learning models. Leveraging the random search algorithm, we fine-tune three distinct models—CNN-ANN, pure CNN, and pure ANN—optimizing each model’s architecture and hyperparameters for enhanced performance. The proposed ensemble model combines these refined models, showcasing promising results. Achieving an accuracy of 98% in LMR mode and 97.33% in MMR mode, the ensemble model demonstrates robust malware detection capabilities. Remarkably, the standalone accuracies of the individual voter models closely approximate that of the ensemble model, highlighting their efficacy. Through a comprehensive analysis of the confusion matrix, we observe that the ensemble voting classifier excels in detecting malicious behavior, with minimal false positives in the benign class. Notably, the model accurately identifies nearly all malicious applications, save for one. In general, the misclassifications observed across the models could also be attributed to various factors such as the inherent complexity and variability of Android malware, the diversity of benign application behaviors, and the challenges associated with accurately representing the semantic meaning of source code through visual features. Additionally, the effectiveness of the models may have been influenced by the size and diversity of the dataset, as well as the specific architectural choices and hyperparameter configurations employed during training.

Also, it has been noted that the proposed VotDroid model achieved an impressively low false negative rate of just 0.0031, signifying the model’s remarkable ability to accurately identify malicious applications. This achievement holds significant promise for real-world applications, as it underscores VotDroid’s efficacy in detecting malware with exceptional precision. By minimizing false negatives to such a minute proportion, our model demonstrates its capacity to reliably identify potential threats, thereby enhancing overall security measures for Android devices. Furthermore, it is essential to recognize that reducing false negatives is paramount in malware detection, as overlooking even a single malicious application can have detrimental consequences. Therefore, VotDroid’s ability to achieve such a negligible false negative rate underscores its reliability and effectiveness in safeguarding against potential security threats. In addition to its remarkable accuracy, VotDroid also exhibits efficiency in terms of detection time, further bolstering its suitability for real-world deployment. The combination of high accuracy and swift detection times positions VotDroid as a robust solution for addressing the evolving landscape of cybersecurity threats in Android environments.

Looking ahead, future research endeavors will explore the applicability of our approach to diverse malware types and platforms. Additionally, efforts will focus on leveraging larger datasets to enhance model training and investigating the integration of deep learning models for feature extraction in classical machine learning algorithms.