Cluster-Preserving Dimension Reduction Methods for Document Classification

  • Peg Howland
  • Haesun Park

In today's vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lower dimensional representation must be a good approximation of the original document set given in its full space. Toward that end, we present mathematical models, based on optimization and a general matrix rank reduction formula, which incorporate a priori knowledge of the existing structure. From these models, we develop new methods for dimension reduction that can be applied regardless of the relative dimensions of the term-document matrix. We illustrate the effectiveness of each method with document classification results from the reduced representation. After establishing relationships among the solutions obtained by the various methods, we conclude with a discussion of their relative accuracy and complexity.


Principal Component Analysis Linear Discriminant Analysis Dimension Reduction Latent Semantic Indexing Full Space 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.zbMATHCrossRefMathSciNetGoogle Scholar
  2. P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711-720, 1997.CrossRefGoogle Scholar
  3. A. Bj örck, Numerical Methods for Least Squares Problems. SIAM, 1996.Google Scholar
  4. R.E. Cline and R.E. Funderlic. The rank of a difference of matrices and associated generalized inverses. Linear Algebra Appl., 24:185-215, 1979.zbMATHCrossRefMathSciNetGoogle Scholar
  5. M.T. Chu and R.E. Funderlic. The centroid decomposition: relationships between discrete variational decompositions and svd. SIAM J. Matrix Anal. Appl., 23:1025-1044,2002.zbMATHCrossRefMathSciNetGoogle Scholar
  6. M.T. Chu, R.E. Funderlic, and G.H. Golub. A rank-one reduction formula and its applications to matrix factorizations. SIAM Review, 37(4):512-530, 1995.zbMATHCrossRefMathSciNetGoogle Scholar
  7. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6):391-407, 1990.CrossRefGoogle Scholar
  8. R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, Inc., New York, second edition, 2001.zbMATHGoogle Scholar
  9. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, Boston, second edition, 1990.zbMATHGoogle Scholar
  10. L. Guttman. A necessary and sufficient formula for matric factoring. Psychometrika, 22(1):79-81, 1957.zbMATHCrossRefMathSciNetGoogle Scholar
  11. G. Golub and C. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, third edition, 1996.zbMATHGoogle Scholar
  12. H.H. Harman. Modern Factor Analysis. University of Chicago Press, second edition, 1967.Google Scholar
  13. P. Howland, M. Jeon, and H. Park. Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM J. Matrix Anal. Appl., 25(1):165-179, 2003.zbMATHCrossRefMathSciNetGoogle Scholar
  14. L. Hubert, J. Meulman, and W. Heiser. Two purposes for matrix factorization: a historical appraisal. SIAM Review, 42(1):68-82, 2000.zbMATHCrossRefMathSciNetGoogle Scholar
  15. P. Horst. Factor Analysis of Data Matrices. Holt, Rinehart and Winston, Inc., 1965.zbMATHGoogle Scholar
  16. P. Howland and H. Park. Equivalence of several two-stage methods for linear discriminant analysis. In Proceedings of Fourth SIAM International Conference on Data Mining, 2004.Google Scholar
  17. A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, Engle-wood Cliffs, NJ, 1988.zbMATHGoogle Scholar
  18. H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37-53, 2005. MathSciNetGoogle Scholar
  19. G. Kowalski. Information Retrieval Systems : Theory and Implementation. Kluwer Academic Publishers, Boston, 1997.zbMATHGoogle Scholar
  20. C.L. Lawson and R.J. Hanson. Solving Least Squares Problems. SIAM, 1995.Google Scholar
  21. C.F. Van Loan. Generalizing the singular value decomposition. SIAM J. Numer. Anal., 13(1):76-83, 1976.zbMATHCrossRefMathSciNetGoogle Scholar
  22. J. Ortega. Matrix Theory: A Second Course. Plenum Press, New York, 1987.zbMATHGoogle Scholar
  23. H. Park, M. Jeon, and J.B. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT Numer. Math., 42(2):1-22, 2003.MathSciNetGoogle Scholar
  24. C.C. Paige and M.A. Saunders. Towards a generalized singular value decomposi-tion. SIAM J. Numer. Anal., 18(3):398-405, 1981.zbMATHCrossRefMathSciNetGoogle Scholar
  25. G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 1971.Google Scholar
  26. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.zbMATHGoogle Scholar
  27. D.L. Swets and J. Weng. Using discriminant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):831-836, 1996.CrossRefGoogle Scholar
  28. L.L. Thurstone. The Vectors of Mind: Multiple Factor Analysis for the Isolation of Primary Traits. University of Chicago Press, Chicago, 1935.CrossRefGoogle Scholar
  29. S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.Google Scholar
  30. K. Torkkola. Linear discriminant analysis in document classification. In IEEE ICDM Workshop on Text Mining, San Diego, 2001.Google Scholar
  31. J.H.M. Wedderburn. Lectures on Matrices, Colloquium Publications, volume 17. American Mathematical Society, New York, 1934.Google Scholar
  32. J. Yang and J.Y. Yang. Why can LDA be performed in PCA transformed space? Pattern Recognition, 36(2):563-566, 2003.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Peg Howland
    • 1
  • Haesun Park
    • 2
  1. 1.Department of Mathematics and StatisticsUtah State UniversityLogan
  2. 2.Division of Computational Science and Engineering College of ComputingGeorgia Institute of TechnologyAtlanta

Personalised recommendations