Abstract
Byte level N-Gram is one of the most used feature extraction algorithms for malware classification because of its good performance and robustness. However, the N-Gram feature selection for a large dataset consumes huge time and space resources due to the large amount of different N-Grams. This paper proposes a partitioning based algorithm for large scale feature selection which efficiently resolves the original problem into in-memory solutions without heavy IO load. The partitioning process adopts an efficient implementation to convert the original interactional dataset to unrelated data partitions. Such data independence enables the effectiveness of the in-memory solutions and the parallelism on different partitions. The proposed algorithm was implemented on Apache Spark, and experimental results show that it is able to select features in a very short period of time which is nearly three times faster than the comparison MapReduce approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Apache: Apache spark (2016). http://spark.apache.org/
AV-TEST: Malware statistics & trends report (2016). http://www.av-test.org/en/statistics/malware/
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)
Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM (2004)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Pietrek, M.: Peering inside the pe: A tour of the win32 portable executable file format (1994). https://msdn.microsoft.com/en-us/library/ms809762.aspx
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, vol. 1. Cambridge University Press, Cambridge (2012)
Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: 2001 IEEE Symposium on Security and Privacy. Proceedings, S&P 2001, pp. 38–49. IEEE (2001)
Shafiq, M.Z., Tabish, S.M., Mirza, F., Farooq, M.: A framework for efficient mining of structural information to detect zero-day malicious portable executables. Technical report. Citeseer (2009)
Shafiq, M.Z., Tabish, S.M., Mirza, F., Farooq, M.: PE-Miner: mining structural information to detect malicious executables in realtime. In: Kirda, E., Jha, S., Balzarotti, D. (eds.) RAID 2009. LNCS, vol. 5758, pp. 121–141. Springer, Heidelberg (2009)
Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)
Tabish, S.M., Shafiq, M.Z., Farooq, M.: Malware detection using statistical analysis of byte-level file content. In: Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, pp. 23–31. ACM (2009)
VX-Heaven: Virus collection (vx heaven) (2016). https://vxheaven.org/vl.php
Wang, W., Zhang, P., Tan, Y., He, X.: Animmune local concentration based virus detection approach. J. Zhejiang Univ. Sci. C 12(6), 443–454 (2011)
Wang, W., Zhang, P., Tan, Y.: An immune concentration based virus detection approach using particle swarm optimization. In: Tan, Y., Shi, Y., Tan, K.C. (eds.) ICSI 2010, Part I. LNCS, vol. 6145, pp. 347–354. Springer, Heidelberg (2010)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
Zhang, P., Tan, Y.: Class-wise information gain. In: 2013 International Conference on Information Science and Technology (ICIST), pp. 972–978. IEEE (2013)
Acknowledgments
This work was supported by the Natural Science Foundation of China (NSFC) under grant no. 61375119 and the Beijing Natural Science Foundation under grant no. 4162029, and partially supported by National Key Basic Research Development Plan (973 Plan) Project of China under grant no. 2015CB352302.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Hu, W., Tan, Y. (2016). Partitioning Based N-Gram Feature Selection for Malware Classification. In: Tan, Y., Shi, Y. (eds) Data Mining and Big Data. DMBD 2016. Lecture Notes in Computer Science(), vol 9714. Springer, Cham. https://doi.org/10.1007/978-3-319-40973-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-40973-3_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40972-6
Online ISBN: 978-3-319-40973-3
eBook Packages: Computer ScienceComputer Science (R0)