Lower Dimensional Representation of Text Data Based on Centroids and Least Squares
 Haesun Park,
 Moongu Jeon,
 J. Ben Rosen
 … show all 3 hide
Rent the article at a discount
Rent now* Final gross prices may vary according to local VAT.
Get AccessAbstract
Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed.
Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.
 Anderberg, M. R. (1973) Cluster Analysis for Applications. Academic Press, New York and London
 Berry, M. W., Dumais, S. T., O'Brien, G. W. (1995) Using linear algebra for intelligent information retrieval. SIAM Review 37: pp. 573595
 Berry, M. W., Drmac, Z., Jessup, E. R. (1999) Matrices, vector spaces, and information retrieval. SIAM Review 41: pp. 335362
 Björck, Å. (1996) Numerical Methods for Least Squares Problems. SIAM, Philadelphia, PA
 Colon, J. R., Colon, S. J. (1996) Optimal use of an information retrieval system. J. Amer, Soc. Information Science 47: pp. 449457
 Chu, M. T., Funderlic, R. E., Golub, G. H. (1995) A rankreduction formula and its applications to matrix factorizations. SIAM Review 37: pp. 512530
 Cline, R. E., Funderlic, R. E. (1979) The rank of a difference of matrices and associated generalized inverses. Linear Algebra Appl. 24: pp. 185215
 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R. (1990) Indexing by latent semantic analysis. J. Soc. Information Science 41: pp. 391407
 Dhillon, I. S., Modha, D. S. (2001) Concept Decompositions for large sparse text data using clustering. Machine Learning 42: pp. 143175
 Dumais, S. T. (1991) Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers 23: pp. 229236
 Eckart, C., Young, G. (1936) The approximation of one matrix by another lower rank. Psychometrika 1: pp. 211218
 Frakes, W. B., BaezaYates, R. (1992) Information Retrieval: Data Structures and Algorithms. PrenticeHall, Englewood Cliffs, NJ
 Gordon, M. D. (1998) Using latent semantic indexing for literature based discovery. J. Amer. Soc. Information Science 49: pp. 674685
 Golub, G. H., Van Loan, C. F. (1996) Matrix Computations. Johns Hopkins University Press, Baltimore
 Gose, E., Johnsonbaugh, R., Jost, S. (1996) Pattern Recognition and Image Analysis. PrenticeHall, Upper Saddle River, NJ
 Guttman, L. (1957) A necessary and sufficient formula for matrix factoring. Psychometrika 22: pp. 7981
 Harter, S. (1992) Psychological relevance and information science. J. Amer. Soc. Information Science 43: pp. 602615
 Heaps, H. S. (1978) Information Retrieval, Computational and Theoretical Aspects. Academic Press, New York
 Hubert, L., Meulman, J., Heiser, W. (2000) Two purposes for matrix factorization: A historical appraisal. SIAM Review 42: pp. 6882
 A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, PrenticeHall, 1988.
 M. Jeon, CentroidBased Dimension Reduction Methods for Classification of High Dimensional Text Data, Ph.D. Dissertation, University of Minnesota, June 2001.
 Jung, Y., Park, H., Du, D. (2000) An Effective termweighting scheme for information retrieval. Tech. Report TR00008. Department of Computer Science and Engineering, University of Minnesota, MN, USA
 Y. Jung, H. Park, and D. Du, A Balanced termweighting scheme for improved document comparison and classification, preprint, 2001.
 Kowalski, G. (1997) Information Retrieval System: Theory and Implementation. Kluwer Academic Publishers, Dordrect
 Kim, H., Howland, P., Park, H. (2003) Text categorization using support vector machines with dimension reduction. Tech. Report TR 03014. Department of Computer Sciece and Engineering, University of Minnesota, MN, USA
 T. G. Kolda, Limitedmemory matrix methods with applications, Ph.D. dissertation, Applied Mathematics, University of Maryland, 1997.
 Kolda, T. G., O'Leary, D. P. (1996) A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Trans. Information Systems 16: pp. 322346
 Krovetz, R., Croft, W. B. (1992) Lexical ambiguity and information retrieval. ACM Trans. Information Systems 10: pp. 115241
 M. Nadler and E. P. Smith, Pattern Recognition Engineering, Wiley, 1993.
 R. T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, in Proceedings of the 20th International Conference on Very Large Databases, 1994, pp. 144155.
 Pejtersen, A. M. (1998) Semantic information retrieval. Comm. ACM 41: pp. 9092
 Rosen, J. B., Park, H., Glick, J. (1996) Total least norm formulation and solution for structured problems. SIAM J. Matrix Anal. Appl. 17: pp. 110128
 Salton, G. (1971) The SMART Retrieval System. PrenticeHall, Englewood Cliffs, NJ
 G. Salton, and M. J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983.
 S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999
 D. Zhang, R. Ramakrishan, and M. Livny, An efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.
 Title
 Lower Dimensional Representation of Text Data Based on Centroids and Least Squares
 Journal

BIT Numerical Mathematics
Volume 43, Issue 2 , pp 427448
 Cover Date
 20030601
 DOI
 10.1023/A:1026039313770
 Print ISSN
 00063835
 Online ISSN
 15729125
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Keywords

 Dimension reduction
 centroids
 least squares
 rank reducing decomposition
 classification
 feature extraction
 Industry Sectors
 Authors

 Haesun Park ^{(1)}
 Moongu Jeon ^{(2)}
 J. Ben Rosen ^{(3)} ^{(4)}
 Author Affiliations

 1. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, U.S.A.
 2. Department of Computer Science and Engineering, Univ. of California, Santa Barbara, Santa Barbara, CA, 93106, U.S.A.
 3. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, U.S.A.
 4. University of California, San Diego, La Jolla, CA, 92093, U.S.A.