Advances in Digital Forensics VII

Volume 361 of the series IFIP Advances in Information and Communication Technology pp 65-75

Fast Content-Based File Type Identification

  • Irfan AhmedAffiliated withInformation Security Institute, Queensland University of Technology
  • , Kyung-Suk LheeAffiliated withAjou University
  • , Hyun-Jung ShinAffiliated withAjou University
  • , Man-Pyo HongAffiliated withAjou University


Digital forensic examiners often need to identify the type of a file or file fragment based on the content of the file. Content-based file type identification schemes typically use a byte frequency distribution with statistical machine learning to classify file types. Most algorithms analyze the entire file content to obtain the byte frequency distribution, a technique that is inefficient and time consuming. This paper proposes two techniques for reducing the classification time. The first technique selects a subset of features based on the frequency of occurrence. The second speeds up classification by randomly sampling file blocks. Experimental results demonstrate that up to a fifteen-fold reduction in computational time can be achieved with limited impact on accuracy.


File type identification file content classification byte frequency