Advertisement

Discussion Tracking in Enron Email Using PARAFAC

  • Brett W. Bader
  • Michael W. Berry
  • Murray Browne
Chapter

In this chapter, we apply a nonnegative tensor factorization algorithm to extract and detect meaningful discussions from electronic mail messages for a period of one year. For the publicly released Enron electronic mail collection, we encode a sparse term-author-month array for subsequent three-way factorization using the PARAllel FACtors (or PARAFAC) three-way decomposition first proposed by Harshman. Using nonnegative tensors, we preserve natural data nonnegativity and avoid subtractive basis vector and encoding interactions present in techniques such as principal component analysis. Results in thread detection and interpretation are discussed in the context of published Enron business practices and activities, and benchmarks addressing the computational complexity of our approach are provided. The resulting tensor factorizations can be used to produce Gantt-like charts that can be used to assess the duration, order, and dependencies of focused discussions against the progression of time.

Keywords

Nonnegative Matrix Factorization Alternate Little Square Tensor Decomposition Federal Energy Regulatory Commission Anchor Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. E. Acar, S.A. C¸amtepe, M.S. Krishnamoorthy, and B. Yener. Modeling and multiway analysis of chatroom tensors. In ISI 2005: IEEE International Conference on Intelligence and Security Informatics, volume 3495 of Lecture Notes in Computer Science, pages 256-268. Springer, New York, 2005.Google Scholar
  2. M.W. Berry and M. Browne. Email surveillance using non-negative matrix factorization. In Workshop on Link Analysis, Counterterrorism and Security, SIAM Conf. on Data Mining, Newport Beach, CA, 2005.Google Scholar
  3. M.W. Berry and M. Browne. Email surveillance using nonnegative matrix factorization. Computational & Mathematical Organization Theory, 11:249-264, 2005.zbMATHCrossRefGoogle Scholar
  4. M.W. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia, second edition, 2005.zbMATHGoogle Scholar
  5. M.W. Berry, M. Browne, A.N. Langville, V.P. Pauca, and R.J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155-173, 2007.zbMATHCrossRefMathSciNetGoogle Scholar
  6. B.W. Bader and T.G. Kolda. Efficient MATLAB computations with sparse and factored tensors. Technical Report SAND2006-7592, Sandia National Laboratories, Albuquerque, New Mexico and Livermore, California, December 2006. Available from World Wide Web: http://csmr.ca.sandia.gov/ tgkolda/ pubs.html#SAND2006- 7592.
  7. B.W. Bader and T.G. Kolda. MATLAB Tensor Toolbox, version 2.1. http:// csmr.ca.sandia.gov/tgkolda/TensorToolbox/, December 2006.
  8. J.D. Carroll and J.J. Chang. Analysis of individual differences in multidimensional scaling via an N-way generalization of ‘Eckart-Young’ decomposition. Psychometrika, 35:283-319, 1970.zbMATHCrossRefGoogle Scholar
  9. W.W. Cohen. Enron email dataset. Web page. http://www.cs.cmu.edu/∼enron/.
  10. N.M. Faber, R. Bro, and P.K. Hopke. Recent developments in CANDECOMP/PARAFAC algorithms: a critical review. Chemometr. Intell. Lab. Syst., 65 (1):119-137, January 2003.CrossRefGoogle Scholar
  11. Federal Energy Regulatory Commission. FERC: Information released in Enron investigation. http://www.ferc.gov/industries/electric/indus-act/wec/enron/info-release.asp.
  12. T. Grieve. The decline and fall of the Enron empire. Slate, October 14 2003. Available from World Wide Web: http://www.salon.com/news/feature/2003/10/14/enron/index\ np.html.
  13. J.T. Giles, L. Wo, and M.W. Berry. GTP (General Text Parser) Software for Text Mining. In H. Bozdogan, editor, Software for Text Mining, in Statistical Data Mining and Knowledge Discovery, pages 455-471. CRC Press, Boca Raton, FL, 2003.Google Scholar
  14. R.A. Harshman. Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA Working Papers in Phonetics, 16:1-84, 1970. Available at http://publish.uwo.ca/∼harshman/wpppfac0.pdf.
  15. T.G. Kolda and B.W. Bader. The TOPHITS model for higher-order web link analysis. In Workshop on Link Analysis, Counterterrorism and Security, 2006. Available from World Wide Web: http://www.cs.rit.edu/∼amt/linkanalysis06/accepted/21.pdf.
  16. T.G. Kolda, B.W. Bader, and J.P. Kenny. Higher-order web link analysis using multilinear algebra. In ICDM 2005: Proceedings of the 5th IEEE International Conference on Data Mining, pages 242-249. IEEE Computer Society, Los Alamitos, CA, 2005.Google Scholar
  17. D. Lee and H. Seung. Learning the parts of objects by non-negative matrix factor-ization. Nature, 401:788-791, 1999.CrossRefGoogle Scholar
  18. B. Mclean and P. Elkind. The Smartest Guys in the Room: The Amazing Rise and Scandalous Fall of Enron. Portfolio, New York, 2003.Google Scholar
  19. M. Mørup, L. K. Hansen, and S. M. Arnfred. Sparse higher order non-negative matrix factorization. Neural Computation, 2006. Submitted.Google Scholar
  20. M. Mørup. Decomposing event related eeg using parallel factor (parafac). Presentation, August 29 2005. Workshop on Tensor Decompositions and Applications, CIRM, Luminy, Marseille, France.Google Scholar
  21. C.E. Priebe, J.M. Conroy, D.J. Marchette, and Y. Park. Enron dataset. Web page, February 2006. http://cis.jhu.edu/∼parky/Enron/enron.html.
  22. J. Shetty and J. Adibi. Ex employee status report. Online, 2005. http://www.isi.edu/∼adibi/Enron/Enron Employee Status.xls.
  23. A. Smilde, R. Bro, and P. Geladi. Multi-Way Analysis: Applications in the Chemical Sciences. Wiley, West Sussex, England, 2004. Available from World Wide Web: http://www.wiley.com/WileyCDA/WileyTitle/productCd- 0471986917.html.
  24. F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons. Document clustering using non-negative matrix factorization. Information Processing & Management, 42 (2):373-386, 2006.zbMATHCrossRefGoogle Scholar
  25. N.D. Sidiropoulos, G.B. Giannakis, and R. Bro. Blind PARAFAC receivers for DS-CDMA systems. IEEE Transactions on Signal Processing, 48(3):810-823, 2000.CrossRefGoogle Scholar
  26. J.-T. Sun, H.-J. Zeng, H. Liu, Y. Lu, and Z. Chen. CubeSVD: a novel approach to personalized Web search. In WWW 2005: Proceedings of the 14th International Conference on World Wide Web, pages 382-390. ACM Press, New York, 2005.CrossRefGoogle Scholar
  27. G. Tomasi and R. Bro. PARAFAC and missing values. Chemometr. Intell. Lab. Syst., 75(2):163-180, February 2005.CrossRefGoogle Scholar
  28. L.R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31:279-311, 1966.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Brett W. Bader
    • 1
  • Michael W. Berry
    • 2
  • Murray Browne
    • 2
  1. 1.Applied Computational Methods DepartmentSandia National LaboratoriesAlbuquerque
  2. 2.Department of Electrical Engineering and Computer ScienceUniversity of TennesseeKnoxville

Personalised recommendations