Identifying Forensically Uninteresting Files Using a Large Corpus

Rowe, Neil C.

doi:10.1007/978-3-319-14289-0_7

Neil C. Rowe¹⁸

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 132))

Included in the following conference series:

International Conference on Digital Forensics and Cyber Crime

1319 Accesses
4 Citations

Abstract

For digital forensics, eliminating the uninteresting is often more critical than finding the interesting. We define “uninteresting” as containing no useful information about users of a drive, a definition which applies to most criminal investigations. Matching file hash values to those in published hash sets is the standard method, but these sets have limited coverage. This work compared nine automated methods of finding additional uninteresting files: (1) frequent hash values, (2) frequent paths, (3) frequent filename-directory pairs, (4) unusually busy times for a drive, (5) unusually busy weeks for a corpus, (6) unusually frequent file sizes, (7) membership in directories containing mostly-known files, (8) known uninteresting directories, and (9) uninteresting extensions. Tests were run on an international corpus of 83.8 million files, and after removing the 25.1 % of files with hash values in the National Software Reference Library, an additional 54.7 % were eliminated that matched two of our nine criteria, few of whose hash values were in two commercial hash sets. False negatives were estimated at 0.1 % and false positives at 19.0 %. We confirmed the generality of our methods by showing a good correlation between results obtained separately on two halves of our corpus. This work provides two kinds of results: 8.4 million hash values of uninteresting files in our own corpus, and programs for finding uninteresting files on new corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 72.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, N., Bolosky, W., Douceur, J., Lorch, J.: A five-year study of file-system metadata. ACM Trans. Storage 3(3), 9:1–9:32 (2007)
Article Google Scholar
Chawathe, S.: Fast fingerprinting for file-system forensics. In: Proceedings of the IEEE Conference on Technologies for Homeland Security, pp. 585–590 (2012)
Google Scholar
Garfinkel, S., Farrell, P., Roussev, V., Dinolt, G.: Bringing science to digital forensics with standardized forensic corpora. Digit. Invest. 6, S2–S11 (2009)
Article Google Scholar
Ke, H.-J., Wang, S.-J., Liu, J., Goyal, D.: Hash-algorithms output for digital evidence in computer forensics. In: Proceedings of the International Conference on Broadband and Wireless Computing, Communication and Applications (2011)
Google Scholar
Kornblum, J.: Auditing hash sets: lessons learned from jurassic park. J. Digit. Forensic Pract. 2(3), 108–112 (2008)
Article Google Scholar
Mead, S.: Unique file identification in the national software reference library. Digit. Invest. 3(3), 138–150 (2006)
Article Google Scholar
Panse, F., Van Keulen, M., Ritter, N.: Indeterministic handling of uncertain decision in deduplication. ACM J. Data Inf. Qual. 4(2), 9 (2013)
Google Scholar
Pearson, S.: Digital Triage Forensics: Processing the Digital Crime Scene. Syngress, New York (2010)
Google Scholar
Pennington, A., Linwood, J., Bucy, J., Strunk, J., Ganger, G.: Storage-based intrusion detection. ACM Trans. Inf. Syst. Secur. 13(4), 30 (2010)
Article Google Scholar
Roussev, V.: Managing terabyte-scale investigations with similarity digests. In: Advances in Digital Forensics VIII, IFIP Advances in Information and Communication Technology vol. 383, pp. 19–34. Pretoria SA (2012)
Google Scholar
Rowe, N.: Testing the national software reference library. Digit. Invest. 9S, S131–S138 (2012). (Proc. Digital Forensics Research Workshop 2012, Washington, DC, August)
Article Google Scholar
Rowe, N., Garfinkel, S.: Finding suspicious activity on computer systems. In: Proceedings of the 11th European Conference on Information Warfare and Security. Laval, France (2012)
Google Scholar
Ruback, M., Hoelz, B., Ralha, C.: A new approach to creating forensic hashsets. In: Advances in Digital Forensics VIII, IFIP Advances in Information and Communication Technology vol. 383, pp. 83–97. Pretoria SA (2012)
Google Scholar
Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V.: Fast file existence checking in archiving systems. ACM Trans. Storage 7(1), 2 (2011)
Article Google Scholar

Download references

Acknowledgements

Riqui Schwamm assisted with the experiments, and Simson Garfinkel provided the corpus. The views expressed are those of the author and do not represent those of the U.S. Government.

Author information

Authors and Affiliations

U.S. Naval Postgraduate School, CS/Rp, GE-328, 1411 Cunningham Road, Monterey, CA, 93943, USA
Neil C. Rowe

Authors

Neil C. Rowe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Neil C. Rowe .

Editor information

Editors and Affiliations

School of Computer Science and Informatics, University College Dublin, Dublin, Ireland
Pavel Gladyshev
College of Technical Innovation, Zayed University, Dubai, Utd.Arab.Emir.
Andrew Marrington
Tagliatela College of Engineering, University of New Haven, West Haven, USA
Ibrahim Baggili

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rowe, N.C. (2014). Identifying Forensically Uninteresting Files Using a Large Corpus. In: Gladyshev, P., Marrington, A., Baggili, I. (eds) Digital Forensics and Cyber Crime. ICDF2C 2013. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 132. Springer, Cham. https://doi.org/10.1007/978-3-319-14289-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-14289-0_7
Published: 23 December 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14288-3
Online ISBN: 978-3-319-14289-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics