Soft Computing

, Volume 22, Issue 2, pp 595–606 | Cite as

A software classification scheme using binary-level characteristics for efficient software filtering

  • Yesol Kim
  • Seong-je Cho
  • Sangchul Han
  • Ilsun You
Methodologies and Application


Software filtering systems can be employed to detect and filter out pirated or counterfeit software on the Web sites and peer-to-peer networks. They determine whether a suspicious program is legal or not by comparing it with original programs in a database or in the market. To identify pirated or counterfeit software, software filtering systems need to measure software similarity when comparing a suspicious program with original ones. In this case, the comparison overhead might be very high because the suspicious program is compared with all programs in the database or market in the worst case. This paper proposes a software classification scheme for efficient software filtering systems. The scheme focuses specifically on the Windows portable executable files which have been prime targets for software pirates. The scheme extracts software characteristics from a suspicious program and classifies it into one of pre-defined categories quickly based on the characteristics. The suspicious program is compared only with the programs in the one of pre-defined categories in most cases; thus, the comparison overhead is reduced. We propose two classification methods. The first one extracts strings from GUI-related resources of a program and computes the relevance of the program to each category based on the pre-computed score of the strings. The second one extracts API call frequency from a program’s execution codes and uses Random Forest technique to classify the program. Experimental results show that the proposed scheme can classify programs effectively and can reduce the comparison overhead significantly.


Software classification Software filtering Text mining Random Forest Portable executable 



The present research was funded by (1) Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2015R1D1A1A02061946) and (2) the research fund of Dankook University (BK21 Plus) in 2014.

Compliance with ethical standards

Conflict of interest

The authors declare there is no conflict of interests regarding the publication of this paper.


  1. Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E (2009) Scalable, behavior-based malware clustering. In: Proceedings of symposium on network and distributed system security (NDSS). The Internet Society, Feb 2009Google Scholar
  2. Bergeron J, Debbabi M, Desharnais J, Erhioui MM, Lavoie Y, Tawbi N et al (2001) Static detection of malicious code in executable programs. Int J Requir Eng 2001(184–189):79Google Scholar
  3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  4. Cadenas JM, Garrido MC, Martínez R, Bonissone PP (2012) Extending information processing in a fuzzy random forest ensemble. Soft Comput 16(5):845–861CrossRefGoogle Scholar
  5. Chan PPF, Hui LCK, Yiu SM (2013) Heap graph based software theft detection. IEEE Trans Inf Forensics Secur 8(1):101–110CrossRefGoogle Scholar
  6. Chen Y-W, Wang J-L, Cai Y-Q, Ji-Xiang D (2015) A method for Chinese text classification based on apparent semantics and latent aspects. J Ambient Intell Humaniz Comput 6(4):473–480CrossRefGoogle Scholar
  7. Dalla Preda M, Christodorescu M, Jha S, Debray S (2008) A semantics-based approach to malware detection. ACM Trans Program Lang Syst 30(5):1–54CrossRefzbMATHGoogle Scholar
  8. Firdausi I, Lim C, Erwin A, Nugroho AS (2010) Analysis of machine learning techniques used in behavior-based malware detection. In: 2010 second international conference on advances in computing, control and telecommunication technologies (ACT). IEEE, pp 201–203Google Scholar
  9. Gandotra E, Bansal D, Sofat S (2014) Malware analysis and classification: a survey. J Inf Secur 5:56–64Google Scholar
  10. Gantz JF, Vavra T, Howard J, Rodolfo R, Lee R, Satidkanitkul A, Taori HN, Sharma R, Villate R, Florean A et al (2013) The dangerous world of counterfeit and pirated software. IDC White PaperGoogle Scholar
  11. Gantz JF, Florean A, Lee R, Lim V, Sikdar B, Lakshmi SKS, Madhavan L, Nagappan M (2014) The link between pirated software and cyber security breaches. IDC White PaperGoogle Scholar
  12. Gupta DL, Malviya AK, Singh S (2012) Performance analysis of classification tree learning algorithms. Int J Comput Appl 55(6) 39–44Google Scholar
  13. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten HI (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18Google Scholar
  14. Han KS, Kang B, Im EG (2011) Malware classification using instruction frequencies. In: Proceedings of the 2011 ACM symposium on research in applied computation (RACS). ACM, pp 298–300Google Scholar
  15. Jang M, Kim D (2013) Filtering illegal android application based on feature information. In: Proceedings of the 2013 research in adaptive and convergent systems. ACM, pp 357–358Google Scholar
  16. Kang SW, Shim H, Cho S, Park M, Han S (2014) A robust and efficient birthmark-based android application filtering system. In: Proceedings of the 2014 conference on research in adaptive and convergent systems. ACM, pp 253–257Google Scholar
  17. Kawaguchi S, Garg PK, Matsushita M, Inoue K (2006) Mudablue: an automatic categorization system for open source repositories. J Syst Softw 79(7):939–953CrossRefGoogle Scholar
  18. Keim DA, Oelke D, Rohrdantz C (2009) Analyzing document collections via context-aware term extraction. Springer, BerlinGoogle Scholar
  19. Kim Y, Park J, Cho S, Nah Y, Han S, Park M (2015) Machine learning-based software classification scheme for efficient program similarity analysis. In: Proceedings of the 2015 conference on research in adaptive and convergent systems. ACM, pp 114–118Google Scholar
  20. Kim D, Kim Y, Cho S, Park M, Han S, Lee G, Hwang Y (2016) An effective and intelligent windows application filtering system using software similarity. Soft Comput 20(5):1821–1827CrossRefGoogle Scholar
  21. Kolter JZ, Maloof MA (2006) Learning to detect and classify malicious executables in the wild. J Mach Learn Res 7:2721–2744MathSciNetzbMATHGoogle Scholar
  22. Lanzi A, Sharif M, Lee W (2009) K-tracer: a system for extracting kernel malware behavior. In: Proceedings of symposium on network and distributed system security (NDSS). The Internet Society, Feb 2009Google Scholar
  23. Lee T, Mody JJ (2006) Behavioral classification. In: Proceedings of annual conference of the European Institute for Computer Antivirus Research (EICAR), pp 1–17, Apr 2006Google Scholar
  24. Linn C, Debray S (2003) Obfuscation of executable code to improve resistance to static disassembly. In: Proceedings of the 10th ACM conference on computer and communications security. ACM, pp 290–299Google Scholar
  25. Litvak M, Last M, Kandel A (2013) Degext: a language-independent keyphrase extractor. J Ambient Intell Humaniz Comput 4(3) 377–387Google Scholar
  26. McMillan C, Linares-Vasquez M, Poshyvanyk D, Grechanik M (2011) Categorizing software applications for maintenance. In: Proceedings of the 27th IEEE international conference on software maintenance (ICSM 2011), Williamsburg, VA, USA, pp 343–352. IEEE, Sept 2011Google Scholar
  27. Moser A, Kruegel C, Kirda E (2007) Limits of static analysis for malware detection. In: Twenty-third annual computer security applications conference, 2007. ACSAC 2007. IEEE, pp 421–430Google Scholar
  28. Narudin FA, Feizollah A, Anuar NB, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput 20(1):343–357CrossRefGoogle Scholar
  29. Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a random forest? In: MLDM. Springer, pp 154–168Google Scholar
  30. Palmien F, Fiore U, Castiglionec A, De Santis A (2013) On the detection of card-sharing traffic through wavelet analysis and support vector machines. Appl Soft Comput 13(1):615–627CrossRefGoogle Scholar
  31. Rieck K, Holz T, Willems C, Düssel P, Laskov P (2008) Learning and classification of malware behavior. In: Proceedings of conference on detection of intrusions and malware, and vulnerability assessment (DIMVA). Springer, pp 108–125Google Scholar
  32. Rieck K, Trinius P, Willems C, Holz T (2011) Automatic analysis of malware behavior using machine learning. J Comput Secur 19(4):639–668CrossRefGoogle Scholar
  33. Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRefzbMATHGoogle Scholar
  34. Schultz MG, Eskin E, Zadok E, Stolfo SJ (2001) Data mining methods for detection of new malicious executables. In: Proceedings of IEEE symposium on security and privacy. IEEE, pp 38–49, May 2001Google Scholar
  35. Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21CrossRefGoogle Scholar
  36. Suh GE, Lee JW, Zhang D, Devadas S (2004) Secure program execution via dynamic information flow tracking. In: ACM Sigplan Notices, vol 39. ACM, pp 85–96Google Scholar
  37. Takçı H, Güngör T (2012) A high performance centroid-based classification approach for language identification. Pattern Recognit Lett 33(16):2077–2084CrossRefGoogle Scholar
  38. Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. In: Proceedings of the 6th IEEE international working conference on mining software repositories (MSR’09), Vancouver, Canada. IEEE, pp 163–166, MayGoogle Scholar
  39. Ugurel S, Krovetz R, Giles CL (2002) What’s the code? Automatic classification of source code archives. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 632–638Google Scholar
  40. Wang D, Zhang H (2013) Inverse-category-frequency based supervised term weighting schemes for text categorization. J Inf Sci Eng 29(2):209–225Google Scholar
  41. Willems C, Holz T, Freiling F (2007) Cwsandbox: towards automated dynamic binary analysis. IEEE Secur Priv 5(2):32–39CrossRefGoogle Scholar
  42. Yang C-Z, Tu M-H (2012) Lacta: an enhanced automatic software categorization on the native code of android applications. In: Proceedings of the international multiconference of engineers and computer scientists (IMECS 2012), vol 1, Hong Kong, Mar 2012Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringDankook UniversityYonginKorea
  2. 2.Department of Computer EngineeringKonkuk UniversityChungjuKorea
  3. 3.Department of Information SecuritySoonchunhyang UniversityAsanKorea

Personalised recommendations