Skip to main content

Cluster-Preserving Dimension Reduction Methods for Document Classification

  • Chapter
Survey of Text Mining II

In today's vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lower dimensional representation must be a good approximation of the original document set given in its full space. Toward that end, we present mathematical models, based on optimization and a general matrix rank reduction formula, which incorporate a priori knowledge of the existing structure. From these models, we develop new methods for dimension reduction that can be applied regardless of the relative dimensions of the term-document matrix. We illustrate the effectiveness of each method with document classification results from the reduced representation. After establishing relationships among the solutions obtained by the various methods, we conclude with a discussion of their relative accuracy and complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  • P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711-720, 1997.

    Article  Google Scholar 

  • A. Bj örck, Numerical Methods for Least Squares Problems. SIAM, 1996.

    Google Scholar 

  • R.E. Cline and R.E. Funderlic. The rank of a difference of matrices and associated generalized inverses. Linear Algebra Appl., 24:185-215, 1979.

    Article  MATH  MathSciNet  Google Scholar 

  • M.T. Chu and R.E. Funderlic. The centroid decomposition: relationships between discrete variational decompositions and svd. SIAM J. Matrix Anal. Appl., 23:1025-1044,2002.

    Article  MATH  MathSciNet  Google Scholar 

  • M.T. Chu, R.E. Funderlic, and G.H. Golub. A rank-one reduction formula and its applications to matrix factorizations. SIAM Review, 37(4):512-530, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  • S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6):391-407, 1990.

    Article  Google Scholar 

  • R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, Inc., New York, second edition, 2001.

    MATH  Google Scholar 

  • K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, Boston, second edition, 1990.

    MATH  Google Scholar 

  • L. Guttman. A necessary and sufficient formula for matric factoring. Psychometrika, 22(1):79-81, 1957.

    Article  MATH  MathSciNet  Google Scholar 

  • G. Golub and C. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, third edition, 1996.

    MATH  Google Scholar 

  • H.H. Harman. Modern Factor Analysis. University of Chicago Press, second edition, 1967.

    Google Scholar 

  • P. Howland, M. Jeon, and H. Park. Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM J. Matrix Anal. Appl., 25(1):165-179, 2003.

    Article  MATH  MathSciNet  Google Scholar 

  • L. Hubert, J. Meulman, and W. Heiser. Two purposes for matrix factorization: a historical appraisal. SIAM Review, 42(1):68-82, 2000.

    Article  MATH  MathSciNet  Google Scholar 

  • P. Horst. Factor Analysis of Data Matrices. Holt, Rinehart and Winston, Inc., 1965.

    MATH  Google Scholar 

  • P. Howland and H. Park. Equivalence of several two-stage methods for linear discriminant analysis. In Proceedings of Fourth SIAM International Conference on Data Mining, 2004.

    Google Scholar 

  • A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, Engle-wood Cliffs, NJ, 1988.

    MATH  Google Scholar 

  • H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37-53, 2005.

    MathSciNet  Google Scholar 

  • G. Kowalski. Information Retrieval Systems : Theory and Implementation. Kluwer Academic Publishers, Boston, 1997.

    MATH  Google Scholar 

  • C.L. Lawson and R.J. Hanson. Solving Least Squares Problems. SIAM, 1995.

    Google Scholar 

  • C.F. Van Loan. Generalizing the singular value decomposition. SIAM J. Numer. Anal., 13(1):76-83, 1976.

    Article  MATH  MathSciNet  Google Scholar 

  • J. Ortega. Matrix Theory: A Second Course. Plenum Press, New York, 1987.

    MATH  Google Scholar 

  • H. Park, M. Jeon, and J.B. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT Numer. Math., 42(2):1-22, 2003.

    MathSciNet  Google Scholar 

  • C.C. Paige and M.A. Saunders. Towards a generalized singular value decomposi-tion. SIAM J. Numer. Anal., 18(3):398-405, 1981.

    Article  MATH  MathSciNet  Google Scholar 

  • G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 1971.

    Google Scholar 

  • G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.

    MATH  Google Scholar 

  • D.L. Swets and J. Weng. Using discriminant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):831-836, 1996.

    Article  Google Scholar 

  • L.L. Thurstone. The Vectors of Mind: Multiple Factor Analysis for the Isolation of Primary Traits. University of Chicago Press, Chicago, 1935.

    Book  Google Scholar 

  • S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.

    Google Scholar 

  • K. Torkkola. Linear discriminant analysis in document classification. In IEEE ICDM Workshop on Text Mining, San Diego, 2001.

    Google Scholar 

  • J.H.M. Wedderburn. Lectures on Matrices, Colloquium Publications, volume 17. American Mathematical Society, New York, 1934.

    Google Scholar 

  • J. Yang and J.Y. Yang. Why can LDA be performed in PCA transformed space? Pattern Recognition, 36(2):563-566, 2003.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag London Limited

About this chapter

Cite this chapter

Howland, P., Park, H. (2008). Cluster-Preserving Dimension Reduction Methods for Document Classification. In: Berry, M.W., Castellanos, M. (eds) Survey of Text Mining II. Springer, London. https://doi.org/10.1007/978-1-84800-046-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-046-9_1

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-045-2

  • Online ISBN: 978-1-84800-046-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics