Abstract
We investigate the structures present in the Enron email dataset using singular value decomposition and semidiscrete decomposition. Using word frequency profiles, we show that messages fall into two distinct groups, whose extrema are characterized by short messages and rare words versus long messages and common words. It is surprising that length of message and word use pattern should be related in this way. We also investigate relationships among individuals based on their patterns of word use in email. We show that word use is correlated to function within the organization, as expected. Lastly, we show that relative changes to individuals' word usage over time can be used to identify key players in major company events.
Similar content being viewed by others
References
British National Corpus (BNC), (2004), http://www.natcorp.ox.ac.uk.
Cohen, W.W. (1996), “Learning to Classify English Text with ILP Methods,” in L. De Raedt (Eds.), Advances in Inductive Logic Programming, IOS Press, pp. 124–143.
Diesner, J. and K. Carley (2005), “Exploration of Communication Networks from the Enron Email Corpus,”in Workshop on Link Analysis, Counterterrorism and Security, SIAM International Conference on Data Mining, pp. 3–14.
European Parliament Temporary Committee on the ECHELON Interception System (2001), “Final Report on the Existence of a Global System for the Interception of Private and Commercial Communications,” Echelon Interception System.
Golub, G.H. and C.F. van Loan (1996), Matrix Computations, 3rd edn. Johns Hopkins University Press.
Kolda, G. and D.P. O'Leary (1998), “A Semi-Discrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval,” ACM Transactions on Information Systems, 16, 322–346.
Kolda, T.G. and D.P. O'Leary (1999), “Computation and Uses of the Semidiscrete Matrix Decomposition,” ACM Transactions on Information Processing.
Lloyd, D. and N. Spruill (2001), “Security Screening and Knowledge Management in the department of defense,” in Federal Conference on Statistical Methodology.
McArthur, R. and P. Bruza (2003), Discovery of Implicit and Explicit Connections Between People Using Email Utterance,” in Proceedings of the Eighth European Conference of Computer-supported Cooperative Work, Helsinki, pp. 21–40.
McConnell, S. and D.B. Skillicorn (2002), “Semidiscrete Decomposition: A Bump Hunting Technique,” in Australasian Data Mining Workshop, pp. 75–82.
O'Brien, C. and C. Vogel (2004), “Exploring the Subject of Email Filtering: Feature Selection in Statistical Filtering.”
Shetty, J. and J. Adibi (2004), “The Enron Email Dataset Database Schema and Brief Statistical Report,” Technical report, Information Sciences Institute.
Simon, A.F. and M. Xenos (2004), “Dimensional Reduction of Word-Frequency Data as a Substitute for Intersubjective Content Analysis,” Political Analysis, 12, 63–75.
Skillicorn, D.B. (2005), “Beyond Keyword Filtering for Message and Conversation Detection,” in IEEE International Conference on Intelligence and Security Informatics (ISI2005), Springer-Verlag Lecture Notes in Computer Science LNCS 3495, pp. 231–243.
Author information
Authors and Affiliations
Corresponding author
Additional information
P.S. Keila is a graduate student in the School of Computing at Queen's University. His research area is data mining in text.
D.B. Skillicorn is a professor in the School of Computing at Queen's University, where he heads the Smart Information Management Laboratory. His research area is data mining using matrix decompositions, particularly applied to complex datasets in areas such as biomedicine, geochemistry, counterterrorism and fraud.
Rights and permissions
About this article
Cite this article
Keila, P.S., Skillicorn, D.B. Structure in the Enron Email Dataset. Comput Math Organiz Theor 11, 183–199 (2005). https://doi.org/10.1007/s10588-005-5379-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10588-005-5379-y