Abstract
Identifying the type of a code, whether in a file or byte stream, is a challenge that many software companies are facing. Many applications, security and others, base their behavior on the type of code they receive as an input.
Today’s traditional identification methods rely on file extensions, magic numbers, propriety headers and trailers or specific type identifying rules. All these are vulnerable to content tampering and discovering it requires investing long and tedious working hours of professionals. This study is aimed to find a method of identifying the best settings to automatically create type signatures that will effectively overcome the content manipulation problem.
In this paper we lay out a framework for creating type signatures based on byte N-Grams. The framework allows setting various parameters such as NGram sizes and windows, selecting statistical tests and defining rules for score calculations. The framework serves as a test lab that allows finding the right parameters to satisfy a predefined threshold of type identification accuracy. We demonstrate the framework using basic settings that achieved an F-Measure success rate of 0.996 on 1400 test files.
Chapter PDF
Similar content being viewed by others
References
McDaniel, M., Heydari, M.H.: Content Based File Type Detection Algorithms. In: Proceedings for the 36th Hawaii International Conference on System Sciences (2002)
Li, W.-J., Stolfo, S.J., Herzog, B.: Fileprints: Identifying File Types by n-gram Analysis. In: 2005 IEEE Workshop on Information Assurance, West Point, NY (2005)
Karresand, M., Shahmehri, N.: Oscar – File Type Identification of Binary Data in Disk Clusters and RAM Pages. In: Fischer-Hübner, S., Rannenberg, K., Yngström, L., Lindskog, S. (eds.) Security and Privacy in Dynamic Environment. IFIP, vol. 206, pp. 413–424. Springer, Boston (2006)
Karresand, M., Shahmehri, N.: File Type Identification of Data Fragments by Their Binary Structure. In: Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY (2006)
Kolter, J.Z., Maloof, M.A.: Learning to Detect Malicious Executables in the Wild. In: Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004)
Dash, K.S., Dubba, S.R.K., Pujari, K.A.: New Malicious Code Detection Using Variable Length n-grams. In: Algorithms, Architectures and Information Systems Security, ch. 14, pp. 307–323. World Scientific (2008)
Irfan, A., Kyung, L., Hyunjung, S., ManPyo, H.: Content-Based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach. IETE Technical Review 27(4) (July 2010)
Moskovitch, R., et al.: Unknown malcode detection and the imbalance problem. Journal in Computer Virology 5(4), 295–308 (2009)
Pedersen, T., Banerjee, S., Purandare, A., McInnes, B.T., Liu, Y.: NSP - Ngram Statistics Package (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 IFIP International Federation for Information Processing
About this paper
Cite this paper
Sharon, R., Gudes, E. (2012). Code Type Revealing Using Experiments Framework. In: Cuppens-Boulahia, N., Cuppens, F., Garcia-Alfaro, J. (eds) Data and Applications Security and Privacy XXVI. DBSec 2012. Lecture Notes in Computer Science, vol 7371. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31540-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-31540-4_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31539-8
Online ISBN: 978-3-642-31540-4
eBook Packages: Computer ScienceComputer Science (R0)