Abstract
E-discovery is the electronic version of identifying, collecting, reviewing, and producing Electronically Stored Information (ESI) for the pre-trial procedure in a prosecution or legal investigation in many countries. There are challenges in e-discovery, such as cost and time consumption of the information retrieval process. Hence, natural language processing (NLP) is a key component in solving this problem, and we show using our case study that it scales effortlessly. Litigation costs are increasing, and as a result, legal professionals have sought to use fast information retrieval and machine learning methods to reduce manual labor and increase accuracy. In this paper, we consider using NLP to represent documents in a topic space using Latent Dirichlet Allocation and solving the information retrieval problem via finding document similarities in the topic space rather than doing it in the corpus vocabulary space. We also used the TF-IDF method in LDA to improve its performance. We report the results of our experiments on the ENRON dataset in the electronic discovery domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Dumais, S., Furnas, G., Landauer, T., Deerwester, S., et al.: Latent semantic indexing. In: Proceedings of the Text Retrieval Conference (1995)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Sebastopol (2009)
Cukierski, W.: The Enron Email Dataset. Kaggle, 16 June 2016. https://www.kaggle.com/wcukierski/enron-email-dataset/version/2#emails.csv
Berry, M.W., Esau, R., Keifer, B.: The Use of Text Mining Techniques in Electronic Discovery for Legal Matters, chap. 8, 174–190. IGI Global (2012)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45, 427–437 (2009)
Irimia, R., Gottschling, M.: Taxonomic revision of Rochefortia Sw. (Ehretiaceae, Boraginales). Biodiv. Data J. 4, e7720 (2016). https://doi.org/10.3897/BDJ.4.e7720
Okolica, J.S., Peterson, G.L., Mills, R.F.: Using PLSI-U to detect insider threats by datamining e-mail. Int. J. Secur. Netw. 3(2), 114 (2008)
https://www.abajournal.com/advertising/article/reducing-costs-with-advance-review-strategies
Matplotlib.org. 2020. Matplotlib: Python Plotting - Matplotlib 3.2.1 Documentation. https://matplotlib.org/. Accessed 18 Apr 2020
PyPI 2020. Pyldavis. https://pypi.org/project/pyLDAvis/. Accessed 18 Apr 2020
Huang, L., Ma, J., Chen, C.: Topic detection from microblogs using T-LDA and perplexity. In: 2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW). IEEE (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Celebi, N., Shashidhar, N. (2022). Topic Modeling in the ENRON Dataset. In: Hu, B., Xia, Y., Zhang, Y., Zhang, LJ. (eds) Big Data – BigData 2022. BigData 2022. Lecture Notes in Computer Science, vol 13730. Springer, Cham. https://doi.org/10.1007/978-3-031-23501-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-23501-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23500-9
Online ISBN: 978-3-031-23501-6
eBook Packages: Computer ScienceComputer Science (R0)