Skip to main content

Cluster-Preserving Dimension Reduction Methods for Efficient Classification of Text Data

  • Chapter

Abstract

In today’s vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lower-dimensional representation must be a good approximation of the original document set given in its full space. Toward that end, we present mathematical models, based on optimization and a general matrix rank reduction formula, which incorporate a priori knowledge of the existing structure. From these models, we develop new methods for dimension reduction based on the centroids of data clusters. We also adapt and extend the discriminant analysis projection, which is well known in pattern recognition. The result is a generalization of discriminant analysis that can be applied regardless of the relative dimensions of the term-document matrix.

Keywords

  • Discriminant Analysis
  • Singular Value Decomposition
  • Dimension Reduction
  • Singular Vector
  • Misclassification Rate

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-1-4757-4305-0_1
  • Chapter length: 21 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-1-4757-4305-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   159.00
Price excludes VAT (USA)
Hardcover Book
USD   159.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. Berry, S. Dumais, and G. O’Brien.Using linear algebra for intelligent information retrieval.SIAM Review, 37 (4): 573–595, 1995.

    MathSciNet  MATH  Google Scholar 

  2. A. Björck.Numerical Methods for Least Squares Problems.SIAM, Philadelphia, 1996.

    Google Scholar 

  3. R.E. Cline and R. E. Funderlic.The rank of a difference of matrices and associated generalized inverses.Linear Algebra Appl., 24: 185–215, 1979.

    Google Scholar 

  4. M.T. Chu, R. E. Funderlic, and G.H. Golub.A rank-one reduction formula and its applications to matrix factorizations.SIAM Review, 37 (4): 512–530, 1995.

    Google Scholar 

  5. S. Deerwester, S. Dumais, G. Fumas, T. Landauer, and R. Harshman.lndexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6): 391–407, 1990.

    CrossRef  Google Scholar 

  6. R.O. Duda, P.E. Hart, and D.G. Stork.Pattern Classification, second edition. Wiley, New York, 2001.

    Google Scholar 

  7. K. Fukunaga.Introduction to Statistical Pattern Recognition, second edition. Academic, Boston, MA, 1990.

    Google Scholar 

  8. G. Golub and C. Van Loan.Matrix Computations, third edition. John Hopkins Univ. Press, Baltimore, MD, 1996.

    Google Scholar 

  9. L. Guttman.A necessary and sufficient formula for matric factoring.Psychometrika, 22 (1): 79–81, 1957.

    Google Scholar 

  10. HJP03] P. Howland, M. Jeon, and H. Park.Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition.SIAM Journal on Matrix Analysis and Applications,2003, to appear.

    Google Scholar 

  11. L. Hubert, J. Meulman, and W. Heiser.Two purposes for matrix factorization: A historical appraisal.SIAM Review, 42 (1): 68–82, 2000.

    MathSciNet  MATH  Google Scholar 

  12. R. Horst.Factor Analysis of Data Matrices.Holt, Rinehart and Winston, Orlando, FL, 1965.

    Google Scholar 

  13. P. Howland and H. Park.Extension of discriminant analysis based on the generalized singular value decomposition.Technical Report 021, Department of Computer Science and Engineering, University of Minnesota, Twin Cities, 2002.

    Google Scholar 

  14. A. Jain and R. Dubes.Algorithms for Clustering Data.Prentice-Hall, Englewood Cliffs, NJ, 1988.

    Google Scholar 

  15. G. Kowalski.Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Hingham, MA, 1997.

    Google Scholar 

  16. C.L. Lawson and R.J. Hanson.Solving Least Squares Problems.SIAM, Philadelphia, 1995.

    Google Scholar 

  17. J. Ortega.Matrix Theory: A Second Course.Plenum, New York, 1987.

    Google Scholar 

  18. H. Park, M. Jeon, and J.B. Rosen.Lower dimensional representation of text data based on centroids and least squares.BIT, 2003, to appear.

    Google Scholar 

  19. C.C. Paige and M.A. Saunders.Towards a generalized singular value decomposition.SIAM Journal on Numerical Analysis, 18 (3): 398–405, 1981.

    Google Scholar 

  20. G. Salton. The SMART Retrieval System.Prentice-Hall, Englewood Cliffs, NJ, 1971.

    Google Scholar 

  21. G. Salton and M.J.McGill.Introductionto Modern Information Retrieval.McGrawHill, New York, 1983.

    Google Scholar 

  22. L.L. Thurstone.A multiple group method of factoring the correlation matrix.Psychometrika, 10 (2): 73–78, 1945.

    Google Scholar 

  23. S. Theodoridis and K. Koutroumbas.Pattern Recognition.Academic, San Diego, 1999.

    Google Scholar 

  24. K. Torkkola.Linear discriminant analysis in document classification.In Proceedings of the IEEE ICDM Workshop on Text Mining,2001.

    Google Scholar 

  25. C.F. Van Loan.Generalizing the singular value decomposition.SIAM Journal on Numerical Analysis, 13 (1): 76–83, 1976.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2004 Springer Science+Business Media New York

About this chapter

Cite this chapter

Howland, P., Park, H. (2004). Cluster-Preserving Dimension Reduction Methods for Efficient Classification of Text Data. In: Berry, M.W. (eds) Survey of Text Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-4305-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4757-4305-0_1

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4419-3057-6

  • Online ISBN: 978-1-4757-4305-0

  • eBook Packages: Springer Book Archive