Partitioning Based N-Gram Feature Selection for Malware Classification

Hu, Weiwei; Tan, Ying

doi:10.1007/978-3-319-40973-3_18

Weiwei Hu¹⁵ &
Ying Tan¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9714))

Included in the following conference series:

International Conference on Data Mining and Big Data

3021 Accesses
3 Citations

Abstract

Byte level N-Gram is one of the most used feature extraction algorithms for malware classification because of its good performance and robustness. However, the N-Gram feature selection for a large dataset consumes huge time and space resources due to the large amount of different N-Grams. This paper proposes a partitioning based algorithm for large scale feature selection which efficiently resolves the original problem into in-memory solutions without heavy IO load. The partitioning process adopts an efficient implementation to convert the original interactional dataset to unrelated data partitions. Such data independence enables the effectiveness of the in-memory solutions and the parallelism on different partitions. The proposed algorithm was implemented on Apache Spark, and experimental results show that it is able to select features in a very short period of time which is nearly three times faster than the comparison MapReduce approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://spark.apache.org/examples.html.

References

Apache: Apache spark (2016). http://spark.apache.org/
AV-TEST: Malware statistics & trends report (2016). http://www.av-test.org/en/statistics/malware/
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)
MathSciNet MATH Google Scholar
Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM (2004)
Google Scholar
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Google Scholar
Pietrek, M.: Peering inside the pe: A tour of the win32 portable executable file format (1994). https://msdn.microsoft.com/en-us/library/ms809762.aspx
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, vol. 1. Cambridge University Press, Cambridge (2012)
Google Scholar
Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: 2001 IEEE Symposium on Security and Privacy. Proceedings, S&P 2001, pp. 38–49. IEEE (2001)
Google Scholar
Shafiq, M.Z., Tabish, S.M., Mirza, F., Farooq, M.: A framework for efficient mining of structural information to detect zero-day malicious portable executables. Technical report. Citeseer (2009)
Google Scholar
Shafiq, M.Z., Tabish, S.M., Mirza, F., Farooq, M.: PE-Miner: mining structural information to detect malicious executables in realtime. In: Kirda, E., Jha, S., Balzarotti, D. (eds.) RAID 2009. LNCS, vol. 5758, pp. 121–141. Springer, Heidelberg (2009)
Chapter Google Scholar
Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)
Article MathSciNet MATH Google Scholar
Tabish, S.M., Shafiq, M.Z., Farooq, M.: Malware detection using statistical analysis of byte-level file content. In: Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, pp. 23–31. ACM (2009)
Google Scholar
VX-Heaven: Virus collection (vx heaven) (2016). https://vxheaven.org/vl.php
Wang, W., Zhang, P., Tan, Y., He, X.: Animmune local concentration based virus detection approach. J. Zhejiang Univ. Sci. C 12(6), 443–454 (2011)
Article Google Scholar
Wang, W., Zhang, P., Tan, Y.: An immune concentration based virus detection approach using particle swarm optimization. In: Tan, Y., Shi, Y., Tan, K.C. (eds.) ICSI 2010, Part I. LNCS, vol. 6145, pp. 347–354. Springer, Heidelberg (2010)
Chapter Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
Google Scholar
Zhang, P., Tan, Y.: Class-wise information gain. In: 2013 International Conference on Information Science and Technology (ICIST), pp. 972–978. IEEE (2013)
Google Scholar

Download references

Acknowledgments

This work was supported by the Natural Science Foundation of China (NSFC) under grant no. 61375119 and the Beijing Natural Science Foundation under grant no. 4162029, and partially supported by National Key Basic Research Development Plan (973 Plan) Project of China under grant no. 2015CB352302.

Author information

Authors and Affiliations

Key Laboratory of Machine Perception (MOE), Department of Machine Intelligence, School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Weiwei Hu & Ying Tan

Authors

Weiwei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Tan .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Ying Tan
Xi'an Jiaotong-Liverpool University, Suzhou, China
Yuhui Shi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, W., Tan, Y. (2016). Partitioning Based N-Gram Feature Selection for Malware Classification. In: Tan, Y., Shi, Y. (eds) Data Mining and Big Data. DMBD 2016. Lecture Notes in Computer Science(), vol 9714. Springer, Cham. https://doi.org/10.1007/978-3-319-40973-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-40973-3_18
Published: 14 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40972-6
Online ISBN: 978-3-319-40973-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics