Abstract
For digital forensics, eliminating the uninteresting is often more critical than finding the interesting. We define “uninteresting” as containing no useful information about users of a drive, a definition which applies to most criminal investigations. Matching file hash values to those in published hash sets is the standard method, but these sets have limited coverage. This work compared nine automated methods of finding additional uninteresting files: (1) frequent hash values, (2) frequent paths, (3) frequent filename-directory pairs, (4) unusually busy times for a drive, (5) unusually busy weeks for a corpus, (6) unusually frequent file sizes, (7) membership in directories containing mostly-known files, (8) known uninteresting directories, and (9) uninteresting extensions. Tests were run on an international corpus of 83.8 million files, and after removing the 25.1 % of files with hash values in the National Software Reference Library, an additional 54.7 % were eliminated that matched two of our nine criteria, few of whose hash values were in two commercial hash sets. False negatives were estimated at 0.1 % and false positives at 19.0 %. We confirmed the generality of our methods by showing a good correlation between results obtained separately on two halves of our corpus. This work provides two kinds of results: 8.4 million hash values of uninteresting files in our own corpus, and programs for finding uninteresting files on new corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, N., Bolosky, W., Douceur, J., Lorch, J.: A five-year study of file-system metadata. ACM Trans. Storage 3(3), 9:1–9:32 (2007)
Chawathe, S.: Fast fingerprinting for file-system forensics. In: Proceedings of the IEEE Conference on Technologies for Homeland Security, pp. 585–590 (2012)
Garfinkel, S., Farrell, P., Roussev, V., Dinolt, G.: Bringing science to digital forensics with standardized forensic corpora. Digit. Invest. 6, S2–S11 (2009)
Ke, H.-J., Wang, S.-J., Liu, J., Goyal, D.: Hash-algorithms output for digital evidence in computer forensics. In: Proceedings of the International Conference on Broadband and Wireless Computing, Communication and Applications (2011)
Kornblum, J.: Auditing hash sets: lessons learned from jurassic park. J. Digit. Forensic Pract. 2(3), 108–112 (2008)
Mead, S.: Unique file identification in the national software reference library. Digit. Invest. 3(3), 138–150 (2006)
Panse, F., Van Keulen, M., Ritter, N.: Indeterministic handling of uncertain decision in deduplication. ACM J. Data Inf. Qual. 4(2), 9 (2013)
Pearson, S.: Digital Triage Forensics: Processing the Digital Crime Scene. Syngress, New York (2010)
Pennington, A., Linwood, J., Bucy, J., Strunk, J., Ganger, G.: Storage-based intrusion detection. ACM Trans. Inf. Syst. Secur. 13(4), 30 (2010)
Roussev, V.: Managing terabyte-scale investigations with similarity digests. In: Advances in Digital Forensics VIII, IFIP Advances in Information and Communication Technology vol. 383, pp. 19–34. Pretoria SA (2012)
Rowe, N.: Testing the national software reference library. Digit. Invest. 9S, S131–S138 (2012). (Proc. Digital Forensics Research Workshop 2012, Washington, DC, August)
Rowe, N., Garfinkel, S.: Finding suspicious activity on computer systems. In: Proceedings of the 11th European Conference on Information Warfare and Security. Laval, France (2012)
Ruback, M., Hoelz, B., Ralha, C.: A new approach to creating forensic hashsets. In: Advances in Digital Forensics VIII, IFIP Advances in Information and Communication Technology vol. 383, pp. 83–97. Pretoria SA (2012)
Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V.: Fast file existence checking in archiving systems. ACM Trans. Storage 7(1), 2 (2011)
Acknowledgements
Riqui Schwamm assisted with the experiments, and Simson Garfinkel provided the corpus. The views expressed are those of the author and do not represent those of the U.S. Government.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Rowe, N.C. (2014). Identifying Forensically Uninteresting Files Using a Large Corpus. In: Gladyshev, P., Marrington, A., Baggili, I. (eds) Digital Forensics and Cyber Crime. ICDF2C 2013. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 132. Springer, Cham. https://doi.org/10.1007/978-3-319-14289-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-14289-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14288-3
Online ISBN: 978-3-319-14289-0
eBook Packages: Computer ScienceComputer Science (R0)