Abstract
Most actionable evidence is identified during the analysis phase of digital forensic investigations. Currently, the analysis phase uses expression-based searches, which assume a good understanding of the evidence; but latent evidence cannot be found using such methods. Knowledge discovery and data mining (KDD) techniques can significantly enhance the analysis process. A promising KDD technique is topic modeling, which infers the underlying semantic context of text and summarizes the text using topics described by words. This paper investigates the application of topic modeling to forensic data and its ability to contribute to the analysis phase. Also, it highlights the challenges that forensic data poses to topic modeling algorithms and reports on the lessons learned from a case study.
Chapter PDF
Similar content being viewed by others
References
N. Beebe and J. Clark, Digital forensic text string searching: Improving information retrieval effectiveness by thematically clustering search results, Digital Investigation, vol. 4S, pp. S49-S54, 2007.
D. Blei, A. Ng and M. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
G. Botha, V. Zimu and E. Barnard, Text-based language identification for the South African languages, Proceedings of the Seventeenth Annual Symposium of the Pattern Recognition Association of South Africa, 2006.
E. Casey, Digital Evidence and Computer Crime, Academic Press, London, United Kingdom, 2000.
P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartzrysler, C. Shearer and R. Wirth, CRISP-DM 1.0: Step-by-Step Data Min- ing Guide, The CRISP-DM Consortium, SPSS, Chicago, Illinois (www.crisp-dm.org/CRISPWP-0800.pdf ), 1999.
T. Griffiths and M. Steyvers, Finding scientific topics, Proceedings of the National Academy of Sciences, vol. 101(1), pp 5228-5235, 2004.
T. Griffiths, M. Steyvers and J. Tenenbaum, Topics in semantic representation, Psychological Review, vol. 114(2), pp. 211-244, 2007.
D. Harman, Overview of the first text retrieval conference, Proceedings of the First Text Retrieval Conference, pp. 1-20, 1992.
A. Louis, A. de Waal and J. Venter, Named entity recognition in a South African context, Proceedings of the Annual Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 170-179, 2006.
D. Mackay, Information Theory, Inference and Learning Algo- rithms, Cambridge University Press, Cambridge, United Kingdom, 2003.
C. McCue, Data Mining and Predictive Analysis: Intelligence Gath- ering and Crime Analysis, Butterworth-Heinemann, Burlington, Massachusetts, 2007.
Q. Mei and C. Zhai, Discovering evolutionary theme patterns from text: An exploration of temporal text mining, Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 198-207, 2005.
D. Newman, C. Chemudugunta, P. Smyth and M. Steyvers, Analyz- ing entities and topics in news articles using statistical topic models, Proceedings of the Intelligence and Security Informatics Conference, pp. 93-104, 2006.
M. Pollitt and A. Whitledge, Exploring big haystacks: Data mining and knowledge management, in Advances in Digital Forensics II, M. Olivier and S. Shenoi (Eds.), Springer, New York, pp. 67-76, 2006.
M. Porter, An algorithm for suffix stripping, Program, vol. 13(3), pp. 130-137, 1980.
L. Rigouste, O. Cappe and F. Yvon, Inference and evaluation of the multinomial mixture model for text clustering, Information Processing and Management, vol. 43(5), pp 1260-1280, 2007.
J. Venter, A. de Waal and N. Willers, Specializing CRISP-DM for evidence mining, in Advances in Digital Forensics III, P. Craiger and S. Shenoi (Eds.), Springer, New York, pp. 303-315, 2007.
X. Wang and A. McCallum, Topics over time: A non-Markov con- tinuous-time model of topical trends, Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424-433, 2006.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 IFIP International Federation for Information Processing
About this paper
Cite this paper
de Waal, A., Venter, J., Barnard, E. (2008). Applying Topic Modeling to Forensic Data. In: Ray, I., Shenoi, S. (eds) Advances in Digital Forensics IV. DigitalForensics 2008. IFIP — The International Federation for Information Processing, vol 285. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84927-0_10
Download citation
DOI: https://doi.org/10.1007/978-0-387-84927-0_10
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-84926-3
Online ISBN: 978-0-387-84927-0
eBook Packages: Computer ScienceComputer Science (R0)