Abstract
Digital forensic examiners often need to identify the type of a file or file fragment based on the content of the file. Content-based file type identification schemes typically use a byte frequency distribution with statistical machine learning to classify file types. Most algorithms analyze the entire file content to obtain the byte frequency distribution, a technique that is inefficient and time consuming. This paper proposes two techniques for reducing the classification time. The first technique selects a subset of features based on the frequency of occurrence. The second speeds up classification by randomly sampling file blocks. Experimental results demonstrate that up to a fifteen-fold reduction in computational time can be achieved with limited impact on accuracy.
Keywords
References
M. Amirani, M. Toorani and A. Shirazi, A new approach to content-based file type detection, Proceedings of the Thirteenth IEEE Symposium on Computers and Communications, pp. 1103–1108, 2008.
W. Calhoun and D. Coles, Predicting the types of file fragments, Digital Investigation, vol. 5(S1), pp. 14–20, 2008.
S. Garfinkel, Carving contiguous and fragmented files with fast object validation, Digital Investigation, vol. 4(S1), pp. 2–12, 2007.
R. Duda, P. Hart and D. Stork, Pattern Classification, John Wiley, New York, 2001.
R. Harris, Using Artificial Neural Networks for Forensic File Type Identification, CERIAS Technical Report 2007-19, Center for Education and Research in Information Assurance and Security, Purdue University, West Lafayette, Indiana, 2007.
C. Hsu and C. Lin, A comparison of methods for multiclass support vector machines, IEEE Transactions on Neural Networks, vol. 13(2), pp. 415–425, 2002.
M. Karresand and N. Shahmehri, File type identification of data fragments by their binary structure, Proceedings of the Seventh Annual IEEE Information Assurance Workshop, pp. 140–147, 2006.
M. Karresand and N. Shahmehri, Oscar – File type identification of binary data in disk clusters and RAM pages, Proceedings of the IFIP International Conference on Information Security, pp. 413–424, 2006.
W. Li, K. Wang, S. Stolfo and B. Herzog, Fileprints: Identifying file types by n-gram analysis, Proceedings of the Sixth Annual IEEE Information Assurance Workshop, pp. 64–71, 2005.
M. McDaniel and M. Heydari, Content based file type detection algorithms, Proceedings of the Thirty-Sixth Annual Hawaii International Conference on System Sciences, 2003.
A. Rencher, Methods of Multivariate Analysis, John Wiley, New York, 2002.
V. Roussev and S. Garfinkel, File fragment classification – The case for specialized approaches, Proceedings of the Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering, pp. 3–14, 2009.
P. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Addison-Wesley, Reading, Massachusetts, 2005.
C. Veenman, Statistical disk cluster classification for file carving, Proceedings of the Third International Symposium on Information Assurance and Security, pp. 393–398, 2007.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 IFIP International Federation for Information Processing
About this paper
Cite this paper
Ahmed, I., Lhee, KS., Shin, HJ., Hong, MP. (2011). Fast Content-Based File Type Identification. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics VII. DigitalForensics 2011. IFIP Advances in Information and Communication Technology, vol 361. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24212-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-24212-0_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24211-3
Online ISBN: 978-3-642-24212-0
eBook Packages: Computer ScienceComputer Science (R0)