Abstract
The large amounts of data that have to be processed and analyzed by forensic investigators is a growing challenge. Using hashsets of known files to identify and filter irrelevant files in forensic investigations is not as effective as it could be, especially in non-English speaking countries. This paper describes the application of data mining techniques to identify irrelevant files from a sample of computers from a country or geographical region. The hashsets corresponding to these files are augmented with an optimized subset of effective hash values chosen from a conventional hash database. Experiments using real evidence demonstrate that the resulting augmented hashset yields 30.69% better filtering results than a conventional hashset although it has approximately half as many (51.83%) hash values.
Chapter PDF
References
N. Beebe and J. Clark, Dealing with terabyte data sets in digital investigations, in Advances in Digital Forensics, M. Pollitt and S. Shenoi (Eds.), Springer, Boston, Massachusetts, pp. 3–16, 2005.
S. Bunting, EnCase Computer Forensics – The Official EnCE: EnCase Certified Examiner Study Guide, Sybex, Hoboken, New Jersey, 2007.
U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, From data mining to knowledge discovery: An overview, in Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Eds.), AAAI Press, Menlo Park, California, pp. 1–34, 1996.
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, California, 2006.
F. Hinshaw, Data warehouse appliances: Driving the business intelligence revolution, Information Management, vol. 14(9), p. 30, 2004.
B. Hoelz, C. Ralha and R. Geeverghese, Artificial intelligence applied to computer forensics, Proceedings of the ACM Symposium on Applied Computing, pp. 883–888, 2009.
K. Kim, S. Park, T. Chang, C. Lee and S. Baek, Lessons learned from the construction of a Korean software reference data set for digital forensics, Digital Investigation, vol. 6(S), pp. S108–S113, 2009.
S. Mead, Unique file identification in the National Software Reference Library, Digital Investigation, vol. 3(3), pp. 138–150, 2006.
National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland ( www.nsrl.nist.gov ).
J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, California, 1993.
L. Rokach and O. Maimon, Data Mining with Decision Trees: Theory and Applications, World Scientific, Singapore, 2008.
V. Roussev, G. Richard and L. Marziale, Class-aware similarity hashing for data classification, in Advances in Digital Forensics IV, I. Ray and S. Shenoi (Eds.), Springer, Boston, Massachusetts, pp. 101–113, 2008.
B. Schneier, Applied Cryptography, John Wiley, New York, 1995.
P. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Addison-Wesley, Boston, Massachusetts, 2005.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 IFIP International Federation for Information Processing
About this paper
Cite this paper
Ruback, M., Hoelz, B., Ralha, C. (2012). A New Approach for Creating Forensic Hashsets. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics VIII. DigitalForensics 2012. IFIP Advances in Information and Communication Technology, vol 383. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33962-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-33962-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33961-5
Online ISBN: 978-3-642-33962-2
eBook Packages: Computer ScienceComputer Science (R0)