Abstract
Nowadays, malware applications are dangerous threats to Android devices, users, developers, and application stores. Researchers are trying to discover new methods for malware detection because the complexity of malwares, their continuous changes, and damages caused by their attacks have increased. One of the most important challenges in detecting malware is to have a balanced dataset. In this paper, a detection method is proposed to identify malware to improve accuracy and reduce error rates by preprocessing the used dataset and balancing it. To attain these purposes, the static analysis is used to extract features of the applications. The ranking methods of features are used to preprocess the feature set and the low-effective features are removed. The proposed method also balances the dataset by using the techniques of undersampling, the Synthetic Minority Oversampling Technique (SMOTE), and a combination of both methods, which have not yet been studied among detection methods. Then, the classifiers of K-Nearest Neighbor (KNN), Support Vector Machine, and Iterative Dichotomiser 3 are used to create the detection model. The performance of KNN with SMOTE is better than the performance of the other classifiers. The obtained results indicate that the criteria of precision, recall, accuracy, F-measure, and Matthews Correlation Coefficient are over 97%. The proposed method is effective in detecting 99.49% of the malware’s existing in the used dataset and new malware.
Similar content being viewed by others
References
Aafer Y, Du W, Yin H (2013) Droidapiminer: Mining api-level features for robust malware detection in android. In: International conference on security and privacy in communication systems, pp 86–103, Springer
Agrawal P, Trivedi B (2019) A survey on android malware and their detection techniques. In: 2019 IEEE International conference on electrical, computer and communication technologies (ICECCT), pp 1–6, IEEE. https://doi.org/10.1109/ICECCT.2019.8868951
Ahmadi M, Ulyanov D, Semenov S, Trofimov M, Giacinto G (2016) Novel feature extraction, selection and fusion for effective malware family classification. In: Proceedings of the sixth ACM conference on data and application security and privacy, pp 183–194, ACM. https://doi.org/10.1145/2857705.2857713
Alam S, Qu Z, Riley R, Chen Y, Rastogi V (2017) Droidnative: Automating and optimizing detection of android native code malware variants. Comput Secur 65:230–246. https://doi.org/10.1016/j.cose.2016.11.011
Arp D, Spreitzenbarth M, Hubner M, Gascon H, Rieck K, Siemens C (2014) Drebin: Effective and explainable detection of android malware in your pocket. In: Ndss, vol. 14, pp 23–26
Aung Z, Zaw W (2013) Permission-based android malware detection. Int J Sci Technol Res 2(3):228–234
Apkpure apps store(bangladesh) (2019). https://apkpure.com/developer/Apps%20for%20Bangladesh
Backes M, Nauman M (2017) Luna: quantifying and leveraging uncertainty in android malware analysis through bayesian machine learning. In: 2017 IEEE European symposium on security and privacy (euros&p), pp 204–217, IEEE. https://doi.org/10.1109/EuroSP.2017.24
Bekkar M, Djemaa HK, Alitouche TA (2013) Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 3(10):1,2, and 4
Canfora G, Di Sorbo A, Mercaldo F, Visaggio CA (2015) Obfuscation techniques against signature-based detection: a case study. In: 2015 Mobile systems technologies workshop (MST), pp 21–26, IEEE. https://doi.org/10.1109/MST.2015.8
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953
Dong S, Li M, Diao W, Liu X, Liu J, Li Z, Xu F, Chen K, Wang X, Zhang K (2018) Understanding android obfuscation techniques: A large-scale investigation in the wild. In: International conference on security and privacy in communication systems, pp 172–192, Springer. https://doi.org/10.1007/978-3-030-01701-9_10
Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905. https://doi.org/10.1613/jair.1.11192
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Imbalanced classification for big data. In: Learning from imbalanced data sets, pp 327–349, Springer. https://doi.org/10.1007/978-3-319-98074-4_13
Garcia J, Hammad M, Malek S (2018) Lightweight, obfuscation-resilient detection and family identification of android malware. ACM Trans Softw Eng Methodology (TOSEM) 26(3):11
Grace M, Zhou Y, Zhang Q, Zou S, Jiang X (2012) Riskranker: scalable and accurate zero-day android malware detection. In: Proceedings of the 10th international conference on mobile systems, applications, and services, pp 281–294, ACM. https://doi.org/10.1145/2307636.2307663
Halimu C, Kasem A, Newaz S (2019) Empirical comparison of area under roc curve (auc) and mathew correlation coefficient (mcc) for evaluating machine learning algorithms on imbalanced datasets for binary classification. In: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing, pp 1–6. ACM
Huawei apps store(china) (2019). https://appstore.huawei.com/
Hung SH, Tu CH, Yeh CW (2016) A cloud-assisted malware detection framework for mobile devices. In: 2016 International computer symposium (ICS), pp 537–54, IEEE. https://doi.org/10.1109/ICS.2016.0112
It threat evolution q3 (2018) statistics — securelist. https://securelist.com/itthreat-evolution-q3-2018-statistics/88689/https://securelist.com/itthreat-evolution-q3-2018-statistics/88689/. [Accessed: 22-Feb-2019]
Iranapps apps store (2019). https://iranapps.ir/
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232. https://doi.org/10.1007/s13748-016-0094-0
Kuncheva LI, Arnaiz-González Á, Díez-pastor JF, Gunn IA (2019) Instance selection improves geometric mean accuracy: a study on imbalanced data classification. Progr Artif Intell 8(2):215–228. https://doi.org/10.1007/s13748-019-00172-4
Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N (2018) A survey on addressing high-class imbalance in big data. J Big Data 5(1):42. https://doi.org/10.1186/s40537-018-0151-6
Lei T, Qin Z, Wang Z, Li Q, Ye D (2019) Evedroid: Event-aware android malware detection against model degrading for iot devices. IEEE Internet of Things Journal. https://doi.org/10.1109/JIOT.2019.2909745
Li J, Sun L, Yan Q, Li Z, Srisa-an W, Ye H (2018) Significant permission identification for machine-learning-based android malware detection. IEEE Trans Industr Inform 14(7):3216–3225. https://doi.org/10.1109/TII.2017.2789219
Ling CX, Sheng VS, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067. https://doi.org/10.1109/TKDE.2006.131
Liu J, Zio E (2019) Integration of feature vector selection and support vector machine for classification of imbalanced data. Appl Soft Comput 75:702–711. https://doi.org/10.1016/j.asoc.2018.11.045
Lopez-Garcia P, Masegosa AD, Osaba E, Onieva E, Perallos A (2019) Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl Intell 49:1–16. https://doi.org/10.1007/s10489-019-01423-6
Lou S, Cheng S, Huang J, Jiang F (2019) Tfdroid: Android malware detection by topics and sensitive data flows using machine learning techniques. In: 2019 IEEE 2Nd international conference on information and computer technologies (ICICT), pp 30–36, IEEE. https://doi.org/10.1109/INFOCT.2019.8711179
Martinelli F, Mercaldo F, Nardone V, Santone A, Sangaiah AK, Cimitile A (2018) Evaluating model checking for cyber threats code obfuscation identification. J Parallel Distrib Comput 119:203–218. https://doi.org/10.1016/j.jpdc.2018.04.008
Martín A, Lara-Cabrera R, Camacho D (2019) Android malware detection through hybrid features fusion and ensemble classifiers: the andropytool framework and the omnidroid dataset. Inform Fusion 52:128–142
McGiff J, Hatcher WG, Nguyen J, Yu W, Blasch E, Lu C (2019) Towards multimodal learning for android malware detection. In: 2019 International conference on computing, networking and communications (ICNC), pp 432–436, IEEE. https://doi.org/10.1109/ICCNC.2019.8685502
Odusami M, Abayomi-Alli O, Misra S, Shobayo O, Damasevicius R, Maskeliunas R (2018) Android malware detection: a survey. In: International conference on applied informatics, pp 255–266, Springer
Pektaş A, Acarman T (2019) Learning to detect android malware via opcode sequences. Neurocomputing. https://doi.org/10.1016/j.neucom.2018.09.102
Pektaş A, Acarman T (2020) Learning to detect android malware via opcode sequences. Neurocomputing 396:599–608
Quan D, Zhai L, Yang F, Wang P (2014) Detection of android malicious apps based on the sensitive behaviors. In: 2014 IEEE 13Th international conference on trust, security and privacy in computing and communications, pp 877–883, IEEE. https://doi.org/10.1109/TrustCom.2014.115
Rout N, Mishra D, Mallick MK (2018) Handling imbalanced data: a survey. In: International proceedings on advances in soft computing, intelligent systems and applications, pp 431–443, Springer. https://doi.org/10.1007/978-981-10-5272-9_39
Samra AAA, Qunoo HN, Al-Rubaie F, El-Talli H (2019) A survey of static android malware detection techniques. In: 2019 IEEE 7Th palestinian international conference on electrical and computer engineering (PICECE), pp 1–6, IEEE. https://doi.org/10.1109/PICECE.2019.8747224
Saracino A, Sgandurra D, Dini G, Martinelli F (2016) Madam: Effective and efficient behavior-based android malware detection and prevention. IEEE Trans Dependable Secure Comput 15(1):83–97
Shrivastava G, Kumar P (2019) Sensdroid: analysis for malicious activity risk of android application. Multimed Tools Appl 78:1–19. https://doi.org/10.1007/s11042-019-07899-1
Siddiqui M, Wang MC, Lee J (2008) A survey of data mining techniques for malware detection using file features. In: Proceedings of the 46th annual southeast regional conference on xx, pp 509–510, ACM. https://doi.org/10.1145/1593105.1593239
Suarez-Tangil G, Dash SK, Ahmadi M, Kinder J, Giacinto G, Cavallaro L (2017) Droidsieve: Fast and accurate classification of obfuscated android malware. In: Proceedings of the Seventh ACM on conference on data and application security and privacy, pp 309–320 ACM. https://doi.org/10.1145/3029806.3029825
Tavallaee M, Stakhanova N, Ghorbani AA (2010) Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Trans Syst Man Cy Part C (App Rev 40(5):516–524. https://doi.org/10.1109/TSMCC.2010.2048428
Ucci D, Aniello L, Baldoni R (2018) Survey of machine learning techniques for malware analysis. Computers & Security. https://doi.org/10.1016/j.cose.2018.11.001
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on computational intelligence and data mining, pp 324–331, IEEE. https://doi.org/10.1109/CIDM.2009.4938667
Wei F., Li Y., Roy S., Ou X., Zhou W. (2017) Deep ground truth analysis of current android malware. In: International conference on detection of intrusions and malware, and vulnerability assessment (DIMVA’17), pp 252–276. Springer, Bonn, Germany. https://doi.org/10.1007/978-3-319-60876-1_12
Yan P, Yan Z (2018) A survey on dynamic mobile malware detection. Softw Qual J 26(3):891–919. https://doi.org/10.1007/s11219-017-9368-4
Yang Q (2006) Wu, x.: 10 challenging problems in data mining research. Int J Inform Technol Dec Making 5(04):597–604. https://doi.org/10.1142/S0219622006002258
Yang W, Xiao X, Andow B, Li S, Xie T, Enck W (2015) Appcontext: Differentiating malicious and benign mobile app behaviors using context. In: Proceedings of the 37th international conference on software engineering-volume 1, pp 303–313. IEEE Press
Yerima SY, Sezer S (2018) Droidfusion: a novel multilevel classifier fusion approach for android malware detection. IEEE Trans Cybern 49(2):453–466. https://doi.org/10.1109/TCYB.2017.2777960
Yuan Z, Lu Y, Xue Y (2016) Droiddetector: android malware characterization and detection using deep learning. Tsinghua Sci Technol 21(1):114–123. https://doi.org/10.1109/TST.2016.7399288
Zhao L, Shang Z, Qin A, Zhang T, Zhao L, Wei Y, Tang YY (2019) A cost-sensitive meta-learning classifier: Spfcnn-miner future generation computer systems. https://doi.org/10.1016/j.future.2019.05.080
Zhou Q, Feng F, Shen Z, Zhou R, Hsieh MY, Li KC (2019) A novel approach for mobile malware classification and detection in android systems. Multimed Tools Appl 78(3):3529–3552. https://doi.org/10.1007/s11042-018-6498-z
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dehkordy, D.T., Rasoolzadegan, A. A new machine learning-based method for android malware detection on imbalanced dataset. Multimed Tools Appl 80, 24533–24554 (2021). https://doi.org/10.1007/s11042-021-10647-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-10647-z