Skip to main content
Log in

A new machine learning-based method for android malware detection on imbalanced dataset

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Nowadays, malware applications are dangerous threats to Android devices, users, developers, and application stores. Researchers are trying to discover new methods for malware detection because the complexity of malwares, their continuous changes, and damages caused by their attacks have increased. One of the most important challenges in detecting malware is to have a balanced dataset. In this paper, a detection method is proposed to identify malware to improve accuracy and reduce error rates by preprocessing the used dataset and balancing it. To attain these purposes, the static analysis is used to extract features of the applications. The ranking methods of features are used to preprocess the feature set and the low-effective features are removed. The proposed method also balances the dataset by using the techniques of undersampling, the Synthetic Minority Oversampling Technique (SMOTE), and a combination of both methods, which have not yet been studied among detection methods. Then, the classifiers of K-Nearest Neighbor (KNN), Support Vector Machine, and Iterative Dichotomiser 3 are used to create the detection model. The performance of KNN with SMOTE is better than the performance of the other classifiers. The obtained results indicate that the criteria of precision, recall, accuracy, F-measure, and Matthews Correlation Coefficient are over 97%. The proposed method is effective in detecting 99.49% of the malware’s existing in the used dataset and new malware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://www.sec.cs.tu-bs.de/~danarp/drebin/index.html

  2. http://amd.arguslab.org

References

  1. Aafer Y, Du W, Yin H (2013) Droidapiminer: Mining api-level features for robust malware detection in android. In: International conference on security and privacy in communication systems, pp 86–103, Springer

  2. Agrawal P, Trivedi B (2019) A survey on android malware and their detection techniques. In: 2019 IEEE International conference on electrical, computer and communication technologies (ICECCT), pp 1–6, IEEE. https://doi.org/10.1109/ICECCT.2019.8868951

  3. Ahmadi M, Ulyanov D, Semenov S, Trofimov M, Giacinto G (2016) Novel feature extraction, selection and fusion for effective malware family classification. In: Proceedings of the sixth ACM conference on data and application security and privacy, pp 183–194, ACM. https://doi.org/10.1145/2857705.2857713

  4. Alam S, Qu Z, Riley R, Chen Y, Rastogi V (2017) Droidnative: Automating and optimizing detection of android native code malware variants. Comput Secur 65:230–246. https://doi.org/10.1016/j.cose.2016.11.011

    Article  Google Scholar 

  5. Arp D, Spreitzenbarth M, Hubner M, Gascon H, Rieck K, Siemens C (2014) Drebin: Effective and explainable detection of android malware in your pocket. In: Ndss, vol. 14, pp 23–26

  6. Aung Z, Zaw W (2013) Permission-based android malware detection. Int J Sci Technol Res 2(3):228–234

    Google Scholar 

  7. Apkpure apps store(bangladesh) (2019). https://apkpure.com/developer/Apps%20for%20Bangladesh

  8. Backes M, Nauman M (2017) Luna: quantifying and leveraging uncertainty in android malware analysis through bayesian machine learning. In: 2017 IEEE European symposium on security and privacy (euros&p), pp 204–217, IEEE. https://doi.org/10.1109/EuroSP.2017.24

  9. Bekkar M, Djemaa HK, Alitouche TA (2013) Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 3(10):1,2, and 4

    Google Scholar 

  10. Canfora G, Di Sorbo A, Mercaldo F, Visaggio CA (2015) Obfuscation techniques against signature-based detection: a case study. In: 2015 Mobile systems technologies workshop (MST), pp 21–26, IEEE. https://doi.org/10.1109/MST.2015.8

  11. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953

    Article  Google Scholar 

  12. Dong S, Li M, Diao W, Liu X, Liu J, Li Z, Xu F, Chen K, Wang X, Zhang K (2018) Understanding android obfuscation techniques: A large-scale investigation in the wild. In: International conference on security and privacy in communication systems, pp 172–192, Springer. https://doi.org/10.1007/978-3-030-01701-9_10

  13. Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905. https://doi.org/10.1613/jair.1.11192

    Article  MathSciNet  Google Scholar 

  14. Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Imbalanced classification for big data. In: Learning from imbalanced data sets, pp 327–349, Springer. https://doi.org/10.1007/978-3-319-98074-4_13

  15. Garcia J, Hammad M, Malek S (2018) Lightweight, obfuscation-resilient detection and family identification of android malware. ACM Trans Softw Eng Methodology (TOSEM) 26(3):11

    Article  Google Scholar 

  16. Grace M, Zhou Y, Zhang Q, Zou S, Jiang X (2012) Riskranker: scalable and accurate zero-day android malware detection. In: Proceedings of the 10th international conference on mobile systems, applications, and services, pp 281–294, ACM. https://doi.org/10.1145/2307636.2307663

  17. Halimu C, Kasem A, Newaz S (2019) Empirical comparison of area under roc curve (auc) and mathew correlation coefficient (mcc) for evaluating machine learning algorithms on imbalanced datasets for binary classification. In: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing, pp 1–6. ACM

  18. Huawei apps store(china) (2019). https://appstore.huawei.com/

  19. Hung SH, Tu CH, Yeh CW (2016) A cloud-assisted malware detection framework for mobile devices. In: 2016 International computer symposium (ICS), pp 537–54, IEEE. https://doi.org/10.1109/ICS.2016.0112

  20. It threat evolution q3 (2018) statistics — securelist. https://securelist.com/itthreat-evolution-q3-2018-statistics/88689/https://securelist.com/itthreat-evolution-q3-2018-statistics/88689/. [Accessed: 22-Feb-2019]

  21. Iranapps apps store (2019). https://iranapps.ir/

  22. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232. https://doi.org/10.1007/s13748-016-0094-0

    Article  Google Scholar 

  23. Kuncheva LI, Arnaiz-González Á, Díez-pastor JF, Gunn IA (2019) Instance selection improves geometric mean accuracy: a study on imbalanced data classification. Progr Artif Intell 8(2):215–228. https://doi.org/10.1007/s13748-019-00172-4

    Article  Google Scholar 

  24. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N (2018) A survey on addressing high-class imbalance in big data. J Big Data 5(1):42. https://doi.org/10.1186/s40537-018-0151-6

    Article  Google Scholar 

  25. Lei T, Qin Z, Wang Z, Li Q, Ye D (2019) Evedroid: Event-aware android malware detection against model degrading for iot devices. IEEE Internet of Things Journal. https://doi.org/10.1109/JIOT.2019.2909745

  26. Li J, Sun L, Yan Q, Li Z, Srisa-an W, Ye H (2018) Significant permission identification for machine-learning-based android malware detection. IEEE Trans Industr Inform 14(7):3216–3225. https://doi.org/10.1109/TII.2017.2789219

    Article  Google Scholar 

  27. Ling CX, Sheng VS, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067. https://doi.org/10.1109/TKDE.2006.131

    Article  Google Scholar 

  28. Liu J, Zio E (2019) Integration of feature vector selection and support vector machine for classification of imbalanced data. Appl Soft Comput 75:702–711. https://doi.org/10.1016/j.asoc.2018.11.045

    Article  Google Scholar 

  29. Lopez-Garcia P, Masegosa AD, Osaba E, Onieva E, Perallos A (2019) Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl Intell 49:1–16. https://doi.org/10.1007/s10489-019-01423-6

    Article  Google Scholar 

  30. Lou S, Cheng S, Huang J, Jiang F (2019) Tfdroid: Android malware detection by topics and sensitive data flows using machine learning techniques. In: 2019 IEEE 2Nd international conference on information and computer technologies (ICICT), pp 30–36, IEEE. https://doi.org/10.1109/INFOCT.2019.8711179

  31. Martinelli F, Mercaldo F, Nardone V, Santone A, Sangaiah AK, Cimitile A (2018) Evaluating model checking for cyber threats code obfuscation identification. J Parallel Distrib Comput 119:203–218. https://doi.org/10.1016/j.jpdc.2018.04.008

    Article  Google Scholar 

  32. Martín A, Lara-Cabrera R, Camacho D (2019) Android malware detection through hybrid features fusion and ensemble classifiers: the andropytool framework and the omnidroid dataset. Inform Fusion 52:128–142

    Article  Google Scholar 

  33. McGiff J, Hatcher WG, Nguyen J, Yu W, Blasch E, Lu C (2019) Towards multimodal learning for android malware detection. In: 2019 International conference on computing, networking and communications (ICNC), pp 432–436, IEEE. https://doi.org/10.1109/ICCNC.2019.8685502

  34. Odusami M, Abayomi-Alli O, Misra S, Shobayo O, Damasevicius R, Maskeliunas R (2018) Android malware detection: a survey. In: International conference on applied informatics, pp 255–266, Springer

  35. Pektaş A, Acarman T (2019) Learning to detect android malware via opcode sequences. Neurocomputing. https://doi.org/10.1016/j.neucom.2018.09.102

  36. Pektaş A, Acarman T (2020) Learning to detect android malware via opcode sequences. Neurocomputing 396:599–608

    Article  Google Scholar 

  37. Quan D, Zhai L, Yang F, Wang P (2014) Detection of android malicious apps based on the sensitive behaviors. In: 2014 IEEE 13Th international conference on trust, security and privacy in computing and communications, pp 877–883, IEEE. https://doi.org/10.1109/TrustCom.2014.115

  38. Rout N, Mishra D, Mallick MK (2018) Handling imbalanced data: a survey. In: International proceedings on advances in soft computing, intelligent systems and applications, pp 431–443, Springer. https://doi.org/10.1007/978-981-10-5272-9_39

  39. Samra AAA, Qunoo HN, Al-Rubaie F, El-Talli H (2019) A survey of static android malware detection techniques. In: 2019 IEEE 7Th palestinian international conference on electrical and computer engineering (PICECE), pp 1–6, IEEE. https://doi.org/10.1109/PICECE.2019.8747224

  40. Saracino A, Sgandurra D, Dini G, Martinelli F (2016) Madam: Effective and efficient behavior-based android malware detection and prevention. IEEE Trans Dependable Secure Comput 15(1):83–97

    Article  Google Scholar 

  41. Shrivastava G, Kumar P (2019) Sensdroid: analysis for malicious activity risk of android application. Multimed Tools Appl 78:1–19. https://doi.org/10.1007/s11042-019-07899-1

    Article  Google Scholar 

  42. Siddiqui M, Wang MC, Lee J (2008) A survey of data mining techniques for malware detection using file features. In: Proceedings of the 46th annual southeast regional conference on xx, pp 509–510, ACM. https://doi.org/10.1145/1593105.1593239

  43. Suarez-Tangil G, Dash SK, Ahmadi M, Kinder J, Giacinto G, Cavallaro L (2017) Droidsieve: Fast and accurate classification of obfuscated android malware. In: Proceedings of the Seventh ACM on conference on data and application security and privacy, pp 309–320 ACM. https://doi.org/10.1145/3029806.3029825

  44. Tavallaee M, Stakhanova N, Ghorbani AA (2010) Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Trans Syst Man Cy Part C (App Rev 40(5):516–524. https://doi.org/10.1109/TSMCC.2010.2048428

    Article  Google Scholar 

  45. Ucci D, Aniello L, Baldoni R (2018) Survey of machine learning techniques for malware analysis. Computers & Security. https://doi.org/10.1016/j.cose.2018.11.001

  46. Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on computational intelligence and data mining, pp 324–331, IEEE. https://doi.org/10.1109/CIDM.2009.4938667

  47. Wei F., Li Y., Roy S., Ou X., Zhou W. (2017) Deep ground truth analysis of current android malware. In: International conference on detection of intrusions and malware, and vulnerability assessment (DIMVA’17), pp 252–276. Springer, Bonn, Germany. https://doi.org/10.1007/978-3-319-60876-1_12

  48. Yan P, Yan Z (2018) A survey on dynamic mobile malware detection. Softw Qual J 26(3):891–919. https://doi.org/10.1007/s11219-017-9368-4

    Article  Google Scholar 

  49. Yang Q (2006) Wu, x.: 10 challenging problems in data mining research. Int J Inform Technol Dec Making 5(04):597–604. https://doi.org/10.1142/S0219622006002258

    Article  Google Scholar 

  50. Yang W, Xiao X, Andow B, Li S, Xie T, Enck W (2015) Appcontext: Differentiating malicious and benign mobile app behaviors using context. In: Proceedings of the 37th international conference on software engineering-volume 1, pp 303–313. IEEE Press

  51. Yerima SY, Sezer S (2018) Droidfusion: a novel multilevel classifier fusion approach for android malware detection. IEEE Trans Cybern 49(2):453–466. https://doi.org/10.1109/TCYB.2017.2777960

    Article  Google Scholar 

  52. Yuan Z, Lu Y, Xue Y (2016) Droiddetector: android malware characterization and detection using deep learning. Tsinghua Sci Technol 21(1):114–123. https://doi.org/10.1109/TST.2016.7399288

    Article  Google Scholar 

  53. Zhao L, Shang Z, Qin A, Zhang T, Zhao L, Wei Y, Tang YY (2019) A cost-sensitive meta-learning classifier: Spfcnn-miner future generation computer systems. https://doi.org/10.1016/j.future.2019.05.080

  54. Zhou Q, Feng F, Shen Z, Zhou R, Hsieh MY, Li KC (2019) A novel approach for mobile malware classification and detection in android systems. Multimed Tools Appl 78(3):3529–3552. https://doi.org/10.1007/s11042-018-6498-z

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abbas Rasoolzadegan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dehkordy, D.T., Rasoolzadegan, A. A new machine learning-based method for android malware detection on imbalanced dataset. Multimed Tools Appl 80, 24533–24554 (2021). https://doi.org/10.1007/s11042-021-10647-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10647-z

Keywords

Navigation