A New Approach to Compressed File Fragment Identification

  • Khoa NguyenEmail author
  • Dat Tran
  • Wanli Ma
  • Dharmendra Sharma
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 369)


Identifying the underlying type of a file given only a file fragment is a big challenge in digital forensics. Many methods have been applied to file type identification; however the identification accuracies of most of file types are still very low, especially for files having complex structures because their contents are compound data built from different data types. In this paper, we propose a new approach based on the deflate-encoded data detection, entropy-based clustering, and the use of machine learning techniques to identify deflate-encoded file fragments. Experiments on the popular compound file type showed high identification accuracy for the proposed method.


File fragment classification Compressed file fragment classification SVM Shannon entropy 


  1. 1.
    Roussev, V., Quates, C.: File fragment encoding classification—an empirical approach. Digit. Investig. 10(Supplement), S69–S77 (2013)CrossRefGoogle Scholar
  2. 2.
    Li, Q., Ong, A., Suganthan, P., Thing, V.: A novel support vector machine approach to high entropy data fragment classification. In: Proceedings of the South African Information Security Multi-conference (SAISMC 2010), 2010Google Scholar
  3. 3.
    Penrose, P., Macfarlane, R., Buchanan, W.J.: Approaches to the classification of high entropy file fragments. Digit. Investig. 10, 372–384 (2013)CrossRefGoogle Scholar
  4. 4.
    Roussev, V., Garfinkel, S.L.: File fragment classification-the case for specialized approaches. In: Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering, 2009 (SADFE ‘09), pp. 3–14Google Scholar
  5. 5.
    Rentz, D.: OpenOffice. org’s documentation of the microsoft compound document. The Spreadsheet Project, (2007)
  6. 6.
    Park, B., Park, J., Lee, S.: Data concealment and detection in Microsoft Office 2007 files. Digit. Investig. 5, 104–114 (2009)CrossRefGoogle Scholar
  7. 7.
    Meehan, J., Rose, T.S.C.C.: PDF reference. Adobe Portable Doc. Format Vers. 1, 1 (2001)Google Scholar
  8. 8.
    Axelsson, S.: The Normalised Compression Distance as a file fragment classifier. Digit. Investig. 7(Supplement), S24–S31 (2010)CrossRefGoogle Scholar
  9. 9.
    Fitzgerald, S., Mathews, G., Morris, C., Zhulyn, O.: Using NLP techniques for file fragment classification. Digit. Investig. 9(Supplement), S44–S49 (2012)CrossRefGoogle Scholar
  10. 10.
    Wei-Jen, L., Ke, W., Stolfo, S.J., Herzog, B.: Fileprints: identifying file types by n-gram analysis. In: Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop, 2005 (IAW ‘05), pp. 64–71Google Scholar
  11. 11.
    Sportiello, L., Zanero, S.: File block classification by support vector machine. In: 2011 Sixth International Conference on Availability, Reliability and Security (ARES), 2011, pp. 307–312Google Scholar
  12. 12.
    Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2, 27 (2011)Google Scholar
  13. 13.
    Deutsch, L.P.: DEFLATE Compressed Data Format Specification Version 1.3 (1996)Google Scholar
  14. 14.
    Park, B., Savoldi, A., Gubian, P., Park, J., Lee, S.H., Lee, S.: Data extraction from damage compressed file for computer forensic purposes. Int. J. Hybrid Inf. Technol. 1, 89–102 (2008)Google Scholar
  15. 15.
    Khoa, N., Dat, T., Wanli, M., Sharma, D.: An approach to detect network attacks applied for network forensics. In: 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2014, pp. 655–660Google Scholar
  16. 16.
    Rice, F.: Introducing the office (2007) open XML file formats. Microsoft Developer Network (2006)Google Scholar
  17. 17.
    Boutell, T.: PNG (Portable Network Graphics) Specification Version 1.0 (1997)Google Scholar
  18. 18.
    Calhoun, W.C., Coles, D.: Predicting the types of file fragments. Digit. Investig. 5(Supplement), S14–S20 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Khoa Nguyen
    • 1
    Email author
  • Dat Tran
    • 1
  • Wanli Ma
    • 1
  • Dharmendra Sharma
    • 1
  1. 1.Faculty of Education Science Technology and MathematicsUniversity of CanberraCanberraAustralia

Personalised recommendations