In today's vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lower dimensional representation must be a good approximation of the original document set given in its full space. Toward that end, we present mathematical models, based on optimization and a general matrix rank reduction formula, which incorporate a priori knowledge of the existing structure. From these models, we develop new methods for dimension reduction that can be applied regardless of the relative dimensions of the term-document matrix. We illustrate the effectiveness of each method with document classification results from the reduced representation. After establishing relationships among the solutions obtained by the various methods, we conclude with a discussion of their relative accuracy and complexity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.
P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711-720, 1997.
A. Bj örck, Numerical Methods for Least Squares Problems. SIAM, 1996.
R.E. Cline and R.E. Funderlic. The rank of a difference of matrices and associated generalized inverses. Linear Algebra Appl., 24:185-215, 1979.
M.T. Chu and R.E. Funderlic. The centroid decomposition: relationships between discrete variational decompositions and svd. SIAM J. Matrix Anal. Appl., 23:1025-1044,2002.
M.T. Chu, R.E. Funderlic, and G.H. Golub. A rank-one reduction formula and its applications to matrix factorizations. SIAM Review, 37(4):512-530, 1995.
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6):391-407, 1990.
R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, Inc., New York, second edition, 2001.
K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, Boston, second edition, 1990.
L. Guttman. A necessary and sufficient formula for matric factoring. Psychometrika, 22(1):79-81, 1957.
G. Golub and C. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, third edition, 1996.
H.H. Harman. Modern Factor Analysis. University of Chicago Press, second edition, 1967.
P. Howland, M. Jeon, and H. Park. Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM J. Matrix Anal. Appl., 25(1):165-179, 2003.
L. Hubert, J. Meulman, and W. Heiser. Two purposes for matrix factorization: a historical appraisal. SIAM Review, 42(1):68-82, 2000.
P. Horst. Factor Analysis of Data Matrices. Holt, Rinehart and Winston, Inc., 1965.
P. Howland and H. Park. Equivalence of several two-stage methods for linear discriminant analysis. In Proceedings of Fourth SIAM International Conference on Data Mining, 2004.
A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, Engle-wood Cliffs, NJ, 1988.
H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37-53, 2005.
G. Kowalski. Information Retrieval Systems : Theory and Implementation. Kluwer Academic Publishers, Boston, 1997.
C.L. Lawson and R.J. Hanson. Solving Least Squares Problems. SIAM, 1995.
C.F. Van Loan. Generalizing the singular value decomposition. SIAM J. Numer. Anal., 13(1):76-83, 1976.
J. Ortega. Matrix Theory: A Second Course. Plenum Press, New York, 1987.
H. Park, M. Jeon, and J.B. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT Numer. Math., 42(2):1-22, 2003.
C.C. Paige and M.A. Saunders. Towards a generalized singular value decomposi-tion. SIAM J. Numer. Anal., 18(3):398-405, 1981.
G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 1971.
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
D.L. Swets and J. Weng. Using discriminant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):831-836, 1996.
L.L. Thurstone. The Vectors of Mind: Multiple Factor Analysis for the Isolation of Primary Traits. University of Chicago Press, Chicago, 1935.
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
K. Torkkola. Linear discriminant analysis in document classification. In IEEE ICDM Workshop on Text Mining, San Diego, 2001.
J.H.M. Wedderburn. Lectures on Matrices, Colloquium Publications, volume 17. American Mathematical Society, New York, 1934.
J. Yang and J.Y. Yang. Why can LDA be performed in PCA transformed space? Pattern Recognition, 36(2):563-566, 2003.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag London Limited
About this chapter
Cite this chapter
Howland, P., Park, H. (2008). Cluster-Preserving Dimension Reduction Methods for Document Classification. In: Berry, M.W., Castellanos, M. (eds) Survey of Text Mining II. Springer, London. https://doi.org/10.1007/978-1-84800-046-9_1
Download citation
DOI: https://doi.org/10.1007/978-1-84800-046-9_1
Publisher Name: Springer, London
Print ISBN: 978-1-84800-045-2
Online ISBN: 978-1-84800-046-9
eBook Packages: Computer ScienceComputer Science (R0)