Skip to main content

Identifying Forensically Uninteresting Files Using a Large Corpus

  • Conference paper
  • First Online:
Digital Forensics and Cyber Crime (ICDF2C 2013)

Abstract

For digital forensics, eliminating the uninteresting is often more critical than finding the interesting. We define “uninteresting” as containing no useful information about users of a drive, a definition which applies to most criminal investigations. Matching file hash values to those in published hash sets is the standard method, but these sets have limited coverage. This work compared nine automated methods of finding additional uninteresting files: (1) frequent hash values, (2) frequent paths, (3) frequent filename-directory pairs, (4) unusually busy times for a drive, (5) unusually busy weeks for a corpus, (6) unusually frequent file sizes, (7) membership in directories containing mostly-known files, (8) known uninteresting directories, and (9) uninteresting extensions. Tests were run on an international corpus of 83.8 million files, and after removing the 25.1 % of files with hash values in the National Software Reference Library, an additional 54.7 % were eliminated that matched two of our nine criteria, few of whose hash values were in two commercial hash sets. False negatives were estimated at 0.1 % and false positives at 19.0 %. We confirmed the generality of our methods by showing a good correlation between results obtained separately on two halves of our corpus. This work provides two kinds of results: 8.4 million hash values of uninteresting files in our own corpus, and programs for finding uninteresting files on new corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 72.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agrawal, N., Bolosky, W., Douceur, J., Lorch, J.: A five-year study of file-system metadata. ACM Trans. Storage 3(3), 9:1–9:32 (2007)

    Article  Google Scholar 

  2. Chawathe, S.: Fast fingerprinting for file-system forensics. In: Proceedings of the IEEE Conference on Technologies for Homeland Security, pp. 585–590 (2012)

    Google Scholar 

  3. Garfinkel, S., Farrell, P., Roussev, V., Dinolt, G.: Bringing science to digital forensics with standardized forensic corpora. Digit. Invest. 6, S2–S11 (2009)

    Article  Google Scholar 

  4. Ke, H.-J., Wang, S.-J., Liu, J., Goyal, D.: Hash-algorithms output for digital evidence in computer forensics. In: Proceedings of the International Conference on Broadband and Wireless Computing, Communication and Applications (2011)

    Google Scholar 

  5. Kornblum, J.: Auditing hash sets: lessons learned from jurassic park. J. Digit. Forensic Pract. 2(3), 108–112 (2008)

    Article  Google Scholar 

  6. Mead, S.: Unique file identification in the national software reference library. Digit. Invest. 3(3), 138–150 (2006)

    Article  Google Scholar 

  7. Panse, F., Van Keulen, M., Ritter, N.: Indeterministic handling of uncertain decision in deduplication. ACM J. Data Inf. Qual. 4(2), 9 (2013)

    Google Scholar 

  8. Pearson, S.: Digital Triage Forensics: Processing the Digital Crime Scene. Syngress, New York (2010)

    Google Scholar 

  9. Pennington, A., Linwood, J., Bucy, J., Strunk, J., Ganger, G.: Storage-based intrusion detection. ACM Trans. Inf. Syst. Secur. 13(4), 30 (2010)

    Article  Google Scholar 

  10. Roussev, V.: Managing terabyte-scale investigations with similarity digests. In: Advances in Digital Forensics VIII, IFIP Advances in Information and Communication Technology vol. 383, pp. 19–34. Pretoria SA (2012)

    Google Scholar 

  11. Rowe, N.: Testing the national software reference library. Digit. Invest. 9S, S131–S138 (2012). (Proc. Digital Forensics Research Workshop 2012, Washington, DC, August)

    Article  Google Scholar 

  12. Rowe, N., Garfinkel, S.: Finding suspicious activity on computer systems. In: Proceedings of the 11th European Conference on Information Warfare and Security. Laval, France (2012)

    Google Scholar 

  13. Ruback, M., Hoelz, B., Ralha, C.: A new approach to creating forensic hashsets. In: Advances in Digital Forensics VIII, IFIP Advances in Information and Communication Technology vol. 383, pp. 83–97. Pretoria SA (2012)

    Google Scholar 

  14. Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V.: Fast file existence checking in archiving systems. ACM Trans. Storage 7(1), 2 (2011)

    Article  Google Scholar 

Download references

Acknowledgements

Riqui Schwamm assisted with the experiments, and Simson Garfinkel provided the corpus. The views expressed are those of the author and do not represent those of the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neil C. Rowe .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Cite this paper

Rowe, N.C. (2014). Identifying Forensically Uninteresting Files Using a Large Corpus. In: Gladyshev, P., Marrington, A., Baggili, I. (eds) Digital Forensics and Cyber Crime. ICDF2C 2013. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 132. Springer, Cham. https://doi.org/10.1007/978-3-319-14289-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14289-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14288-3

  • Online ISBN: 978-3-319-14289-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics