Fast Content-Based File Type Identification

  • Irfan Ahmed
  • Kyung-Suk Lhee
  • Hyun-Jung Shin
  • Man-Pyo Hong
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 361)


Digital forensic examiners often need to identify the type of a file or file fragment based on the content of the file. Content-based file type identification schemes typically use a byte frequency distribution with statistical machine learning to classify file types. Most algorithms analyze the entire file content to obtain the byte frequency distribution, a technique that is inefficient and time consuming. This paper proposes two techniques for reducing the classification time. The first technique selects a subset of features based on the frequency of occurrence. The second speeds up classification by randomly sampling file blocks. Experimental results demonstrate that up to a fifteen-fold reduction in computational time can be achieved with limited impact on accuracy.


File type identification file content classification byte frequency 


  1. 1.
    M. Amirani, M. Toorani and A. Shirazi, A new approach to content-based file type detection, Proceedings of the Thirteenth IEEE Symposium on Computers and Communications, pp. 1103–1108, 2008.Google Scholar
  2. 2.
    W. Calhoun and D. Coles, Predicting the types of file fragments, Digital Investigation, vol. 5(S1), pp. 14–20, 2008.CrossRefGoogle Scholar
  3. 3.
    S. Garfinkel, Carving contiguous and fragmented files with fast object validation, Digital Investigation, vol. 4(S1), pp. 2–12, 2007.CrossRefGoogle Scholar
  4. 4.
    R. Duda, P. Hart and D. Stork, Pattern Classification, John Wiley, New York, 2001.zbMATHGoogle Scholar
  5. 5.
    R. Harris, Using Artificial Neural Networks for Forensic File Type Identification, CERIAS Technical Report 2007-19, Center for Education and Research in Information Assurance and Security, Purdue University, West Lafayette, Indiana, 2007.Google Scholar
  6. 6.
    C. Hsu and C. Lin, A comparison of methods for multiclass support vector machines, IEEE Transactions on Neural Networks, vol. 13(2), pp. 415–425, 2002.CrossRefGoogle Scholar
  7. 7.
    M. Karresand and N. Shahmehri, File type identification of data fragments by their binary structure, Proceedings of the Seventh Annual IEEE Information Assurance Workshop, pp. 140–147, 2006.CrossRefGoogle Scholar
  8. 8.
    M. Karresand and N. Shahmehri, Oscar – File type identification of binary data in disk clusters and RAM pages, Proceedings of the IFIP International Conference on Information Security, pp. 413–424, 2006.Google Scholar
  9. 9.
    W. Li, K. Wang, S. Stolfo and B. Herzog, Fileprints: Identifying file types by n-gram analysis, Proceedings of the Sixth Annual IEEE Information Assurance Workshop, pp. 64–71, 2005.Google Scholar
  10. 10.
    M. McDaniel and M. Heydari, Content based file type detection algorithms, Proceedings of the Thirty-Sixth Annual Hawaii International Conference on System Sciences, 2003.Google Scholar
  11. 11.
    A. Rencher, Methods of Multivariate Analysis, John Wiley, New York, 2002.zbMATHCrossRefGoogle Scholar
  12. 12.
    V. Roussev and S. Garfinkel, File fragment classification – The case for specialized approaches, Proceedings of the Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering, pp. 3–14, 2009.CrossRefGoogle Scholar
  13. 13.
    P. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Addison-Wesley, Reading, Massachusetts, 2005.Google Scholar
  14. 14.
    C. Veenman, Statistical disk cluster classification for file carving, Proceedings of the Third International Symposium on Information Assurance and Security, pp. 393–398, 2007.CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2011

Authors and Affiliations

  • Irfan Ahmed
    • 1
  • Kyung-Suk Lhee
    • 2
  • Hyun-Jung Shin
    • 2
  • Man-Pyo Hong
    • 2
  1. 1.Information Security InstituteQueensland University of TechnologyBrisbaneAustralia
  2. 2.Ajou UniversitySuwonSouth Korea

Personalised recommendations