Topic Modeling in the ENRON Dataset

Celebi, Naciye; Shashidhar, Narasimha

doi:10.1007/978-3-031-23501-6_4

Naciye Celebi¹¹ &
Narasimha Shashidhar¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13730))

Included in the following conference series:

International Conference on Big Data

331 Accesses

Abstract

E-discovery is the electronic version of identifying, collecting, reviewing, and producing Electronically Stored Information (ESI) for the pre-trial procedure in a prosecution or legal investigation in many countries. There are challenges in e-discovery, such as cost and time consumption of the information retrieval process. Hence, natural language processing (NLP) is a key component in solving this problem, and we show using our case study that it scales effortlessly. Litigation costs are increasing, and as a result, legal professionals have sought to use fast information retrieval and machine learning methods to reduce manual labor and increase accuracy. In this paper, we consider using NLP to represent documents in a topic space using Latent Dirichlet Allocation and solving the information retrieval problem via finding document similarities in the topic space rather than doing it in the corpus vocabulary space. We also used the TF-IDF method in LDA to improve its performance. We report the results of our experiments on the ENRON dataset in the electronic discovery domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

https://www.capterra.com/p/119799/Octane-Platform/
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Dumais, S., Furnas, G., Landauer, T., Deerwester, S., et al.: Latent semantic indexing. In: Proceedings of the Text Retrieval Conference (1995)
Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Sebastopol (2009)
Google Scholar
Cukierski, W.: The Enron Email Dataset. Kaggle, 16 June 2016. https://www.kaggle.com/wcukierski/enron-email-dataset/version/2#emails.csv
Berry, M.W., Esau, R., Keifer, B.: The Use of Text Mining Techniques in Electronic Discovery for Legal Matters, chap. 8, 174–190. IGI Global (2012)
Google Scholar
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45, 427–437 (2009)
Article Google Scholar
Irimia, R., Gottschling, M.: Taxonomic revision of Rochefortia Sw. (Ehretiaceae, Boraginales). Biodiv. Data J. 4, e7720 (2016). https://doi.org/10.3897/BDJ.4.e7720
Okolica, J.S., Peterson, G.L., Mills, R.F.: Using PLSI-U to detect insider threats by datamining e-mail. Int. J. Secur. Netw. 3(2), 114 (2008)
Article Google Scholar
https://www.abajournal.com/advertising/article/reducing-costs-with-advance-review-strategies
Matplotlib.org. 2020. Matplotlib: Python Plotting - Matplotlib 3.2.1 Documentation. https://matplotlib.org/. Accessed 18 Apr 2020
PyPI 2020. Pyldavis. https://pypi.org/project/pyLDAvis/. Accessed 18 Apr 2020
Huang, L., Ma, J., Chen, C.: Topic detection from microblogs using T-LDA and perplexity. In: 2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW). IEEE (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science Sam Houston State University, Huntsville, TX, USA
Naciye Celebi & Narasimha Shashidhar

Authors

Naciye Celebi
View author publications
You can also search for this author in PubMed Google Scholar
Narasimha Shashidhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Naciye Celebi .

Editor information

Editors and Affiliations

Shenzhen Yihuo Technology Co., Ltd., Shenzhen, China
Bo Hu
Chongqing University, Chongqing, China
Yunni Xia
Anhui University, Hefei, China
Yiwen Zhang
Kingdee International Software Group Co., Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Celebi, N., Shashidhar, N. (2022). Topic Modeling in the ENRON Dataset. In: Hu, B., Xia, Y., Zhang, Y., Zhang, LJ. (eds) Big Data – BigData 2022. BigData 2022. Lecture Notes in Computer Science, vol 13730. Springer, Cham. https://doi.org/10.1007/978-3-031-23501-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-23501-6_4
Published: 10 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23500-9
Online ISBN: 978-3-031-23501-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Topic Modeling in the ENRON Dataset