Skip to main content
Log in

A software classification scheme using binary-level characteristics for efficient software filtering

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Software filtering systems can be employed to detect and filter out pirated or counterfeit software on the Web sites and peer-to-peer networks. They determine whether a suspicious program is legal or not by comparing it with original programs in a database or in the market. To identify pirated or counterfeit software, software filtering systems need to measure software similarity when comparing a suspicious program with original ones. In this case, the comparison overhead might be very high because the suspicious program is compared with all programs in the database or market in the worst case. This paper proposes a software classification scheme for efficient software filtering systems. The scheme focuses specifically on the Windows portable executable files which have been prime targets for software pirates. The scheme extracts software characteristics from a suspicious program and classifies it into one of pre-defined categories quickly based on the characteristics. The suspicious program is compared only with the programs in the one of pre-defined categories in most cases; thus, the comparison overhead is reduced. We propose two classification methods. The first one extracts strings from GUI-related resources of a program and computes the relevance of the program to each category based on the pre-computed score of the strings. The second one extracts API call frequency from a program’s execution codes and uses Random Forest technique to classify the program. Experimental results show that the proposed scheme can classify programs effectively and can reduce the comparison overhead significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E (2009) Scalable, behavior-based malware clustering. In: Proceedings of symposium on network and distributed system security (NDSS). The Internet Society, Feb 2009

  • Bergeron J, Debbabi M, Desharnais J, Erhioui MM, Lavoie Y, Tawbi N et al (2001) Static detection of malicious code in executable programs. Int J Requir Eng 2001(184–189):79

    Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Cadenas JM, Garrido MC, Martínez R, Bonissone PP (2012) Extending information processing in a fuzzy random forest ensemble. Soft Comput 16(5):845–861

    Article  Google Scholar 

  • Chan PPF, Hui LCK, Yiu SM (2013) Heap graph based software theft detection. IEEE Trans Inf Forensics Secur 8(1):101–110

    Article  Google Scholar 

  • Chen Y-W, Wang J-L, Cai Y-Q, Ji-Xiang D (2015) A method for Chinese text classification based on apparent semantics and latent aspects. J Ambient Intell Humaniz Comput 6(4):473–480

    Article  Google Scholar 

  • Dalla Preda M, Christodorescu M, Jha S, Debray S (2008) A semantics-based approach to malware detection. ACM Trans Program Lang Syst 30(5):1–54

    Article  MATH  Google Scholar 

  • Firdausi I, Lim C, Erwin A, Nugroho AS (2010) Analysis of machine learning techniques used in behavior-based malware detection. In: 2010 second international conference on advances in computing, control and telecommunication technologies (ACT). IEEE, pp 201–203

  • Gandotra E, Bansal D, Sofat S (2014) Malware analysis and classification: a survey. J Inf Secur 5:56–64

  • Gantz JF, Vavra T, Howard J, Rodolfo R, Lee R, Satidkanitkul A, Taori HN, Sharma R, Villate R, Florean A et al (2013) The dangerous world of counterfeit and pirated software. IDC White Paper

  • Gantz JF, Florean A, Lee R, Lim V, Sikdar B, Lakshmi SKS, Madhavan L, Nagappan M (2014) The link between pirated software and cyber security breaches. IDC White Paper

  • Gupta DL, Malviya AK, Singh S (2012) Performance analysis of classification tree learning algorithms. Int J Comput Appl 55(6) 39–44

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten HI (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

  • Han KS, Kang B, Im EG (2011) Malware classification using instruction frequencies. In: Proceedings of the 2011 ACM symposium on research in applied computation (RACS). ACM, pp 298–300

  • Jang M, Kim D (2013) Filtering illegal android application based on feature information. In: Proceedings of the 2013 research in adaptive and convergent systems. ACM, pp 357–358

  • Kang SW, Shim H, Cho S, Park M, Han S (2014) A robust and efficient birthmark-based android application filtering system. In: Proceedings of the 2014 conference on research in adaptive and convergent systems. ACM, pp 253–257

  • Kawaguchi S, Garg PK, Matsushita M, Inoue K (2006) Mudablue: an automatic categorization system for open source repositories. J Syst Softw 79(7):939–953

    Article  Google Scholar 

  • Keim DA, Oelke D, Rohrdantz C (2009) Analyzing document collections via context-aware term extraction. Springer, Berlin

    Google Scholar 

  • Kim Y, Park J, Cho S, Nah Y, Han S, Park M (2015) Machine learning-based software classification scheme for efficient program similarity analysis. In: Proceedings of the 2015 conference on research in adaptive and convergent systems. ACM, pp 114–118

  • Kim D, Kim Y, Cho S, Park M, Han S, Lee G, Hwang Y (2016) An effective and intelligent windows application filtering system using software similarity. Soft Comput 20(5):1821–1827

    Article  Google Scholar 

  • Kolter JZ, Maloof MA (2006) Learning to detect and classify malicious executables in the wild. J Mach Learn Res 7:2721–2744

    MathSciNet  MATH  Google Scholar 

  • Lanzi A, Sharif M, Lee W (2009) K-tracer: a system for extracting kernel malware behavior. In: Proceedings of symposium on network and distributed system security (NDSS). The Internet Society, Feb 2009

  • Lee T, Mody JJ (2006) Behavioral classification. In: Proceedings of annual conference of the European Institute for Computer Antivirus Research (EICAR), pp 1–17, Apr 2006

  • Linn C, Debray S (2003) Obfuscation of executable code to improve resistance to static disassembly. In: Proceedings of the 10th ACM conference on computer and communications security. ACM, pp 290–299

  • Litvak M, Last M, Kandel A (2013) Degext: a language-independent keyphrase extractor. J Ambient Intell Humaniz Comput 4(3) 377–387

  • McMillan C, Linares-Vasquez M, Poshyvanyk D, Grechanik M (2011) Categorizing software applications for maintenance. In: Proceedings of the 27th IEEE international conference on software maintenance (ICSM 2011), Williamsburg, VA, USA, pp 343–352. IEEE, Sept 2011

  • Moser A, Kruegel C, Kirda E (2007) Limits of static analysis for malware detection. In: Twenty-third annual computer security applications conference, 2007. ACSAC 2007. IEEE, pp 421–430

  • Narudin FA, Feizollah A, Anuar NB, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput 20(1):343–357

    Article  Google Scholar 

  • Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a random forest? In: MLDM. Springer, pp 154–168

  • Palmien F, Fiore U, Castiglionec A, De Santis A (2013) On the detection of card-sharing traffic through wavelet analysis and support vector machines. Appl Soft Comput 13(1):615–627

    Article  Google Scholar 

  • Rieck K, Holz T, Willems C, Düssel P, Laskov P (2008) Learning and classification of malware behavior. In: Proceedings of conference on detection of intrusions and malware, and vulnerability assessment (DIMVA). Springer, pp 108–125

  • Rieck K, Trinius P, Willems C, Holz T (2011) Automatic analysis of malware behavior using machine learning. J Comput Secur 19(4):639–668

    Article  Google Scholar 

  • Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Schultz MG, Eskin E, Zadok E, Stolfo SJ (2001) Data mining methods for detection of new malicious executables. In: Proceedings of IEEE symposium on security and privacy. IEEE, pp 38–49, May 2001

  • SourceForge. http://sourceforge.net

  • Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Article  Google Scholar 

  • Suh GE, Lee JW, Zhang D, Devadas S (2004) Secure program execution via dynamic information flow tracking. In: ACM Sigplan Notices, vol 39. ACM, pp 85–96

  • Takçı H, Güngör T (2012) A high performance centroid-based classification approach for language identification. Pattern Recognit Lett 33(16):2077–2084

    Article  Google Scholar 

  • Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. In: Proceedings of the 6th IEEE international working conference on mining software repositories (MSR’09), Vancouver, Canada. IEEE, pp 163–166, May

  • Ugurel S, Krovetz R, Giles CL (2002) What’s the code? Automatic classification of source code archives. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 632–638

  • Wang D, Zhang H (2013) Inverse-category-frequency based supervised term weighting schemes for text categorization. J Inf Sci Eng 29(2):209–225

    Google Scholar 

  • Willems C, Holz T, Freiling F (2007) Cwsandbox: towards automated dynamic binary analysis. IEEE Secur Priv 5(2):32–39

    Article  Google Scholar 

  • Yang C-Z, Tu M-H (2012) Lacta: an enhanced automatic software categorization on the native code of android applications. In: Proceedings of the international multiconference of engineers and computer scientists (IMECS 2012), vol 1, Hong Kong, Mar 2012

Download references

Acknowledgments

The present research was funded by (1) Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2015R1D1A1A02061946) and (2) the research fund of Dankook University (BK21 Plus) in 2014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seong-je Cho.

Ethics declarations

Conflict of interest

The authors declare there is no conflict of interests regarding the publication of this paper.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, Y., Cho, Sj., Han, S. et al. A software classification scheme using binary-level characteristics for efficient software filtering. Soft Comput 22, 595–606 (2018). https://doi.org/10.1007/s00500-016-2357-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-016-2357-x

Keywords

Navigation