Lower Dimensional Representation of Text Data Based on Centroids and Least Squares
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed.
Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.
- Anderberg, M. R. (1973) Cluster Analysis for Applications. Academic Press, New York and London
- Berry, M. W., Dumais, S. T., O'Brien, G. W. (1995) Using linear algebra for intelligent information retrieval. SIAM Review 37: pp. 573-595
- Berry, M. W., Drmac, Z., Jessup, E. R. (1999) Matrices, vector spaces, and information retrieval. SIAM Review 41: pp. 335-362
- Björck, Å. (1996) Numerical Methods for Least Squares Problems. SIAM, Philadelphia, PA
- Colon, J. R., Colon, S. J. (1996) Optimal use of an information retrieval system. J. Amer, Soc. Information Science 47: pp. 449-457
- Chu, M. T., Funderlic, R. E., Golub, G. H. (1995) A rank-reduction formula and its applications to matrix factorizations. SIAM Review 37: pp. 512-530
- Cline, R. E., Funderlic, R. E. (1979) The rank of a difference of matrices and associated generalized inverses. Linear Algebra Appl. 24: pp. 185-215
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R. (1990) Indexing by latent semantic analysis. J. Soc. Information Science 41: pp. 391-407
- Dhillon, I. S., Modha, D. S. (2001) Concept Decompositions for large sparse text data using clustering. Machine Learning 42: pp. 143-175
- Dumais, S. T. (1991) Improving the retrieval of information from external sources. Behav-ior Research Methods, Instruments, & Computers 23: pp. 229-236
- Eckart, C., Young, G. (1936) The approximation of one matrix by another lower rank. Psychometrika 1: pp. 211-218
- Frakes, W. B., Baeza-Yates, R. (1992) Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ
- Gordon, M. D. (1998) Using latent semantic indexing for literature based discovery. J. Amer. Soc. Information Science 49: pp. 674-685
- Golub, G. H., Van Loan, C. F. (1996) Matrix Computations. Johns Hopkins University Press, Baltimore
- Gose, E., Johnsonbaugh, R., Jost, S. (1996) Pattern Recognition and Image Analysis. Prentice-Hall, Upper Saddle River, NJ
- Guttman, L. (1957) A necessary and sufficient formula for matrix factoring. Psychometrika 22: pp. 79-81
- Harter, S. (1992) Psychological relevance and information science. J. Amer. Soc. Information Science 43: pp. 602-615
- Heaps, H. S. (1978) Information Retrieval, Computational and Theoretical Aspects. Academic Press, New York
- Hubert, L., Meulman, J., Heiser, W. (2000) Two purposes for matrix factorization: A historical appraisal. SIAM Review 42: pp. 68-82
- A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988.
- M. Jeon, Centroid-Based Dimension Reduction Methods for Classification of High Dimensional Text Data, Ph.D. Dissertation, University of Minnesota, June 2001.
- Jung, Y., Park, H., Du, D. (2000) An Effective term-weighting scheme for information retrieval. Tech. Report TR00-008. Department of Computer Science and Engineering, University of Minnesota, MN, USA
- Y. Jung, H. Park, and D. Du, A Balanced term-weighting scheme for improved document comparison and classification, preprint, 2001.
- Kowalski, G. (1997) Information Retrieval System: Theory and Implementation. Kluwer Academic Publishers, Dordrect
- Kim, H., Howland, P., Park, H. (2003) Text categorization using support vector machines with dimension reduction. Tech. Report TR 03-014. Department of Computer Sciece and Engineering, University of Minnesota, MN, USA
- T. G. Kolda, Limited-memory matrix methods with applications, Ph.D. dissertation, Applied Mathematics, University of Maryland, 1997.
- Kolda, T. G., O'Leary, D. P. (1996) A semi-discrete matrix decomposition for latent semantic indexing in information retrieval. ACM Trans. Information Systems 16: pp. 322-346
- Krovetz, R., Croft, W. B. (1992) Lexical ambiguity and information retrieval. ACM Trans. Information Systems 10: pp. 115-241
- M. Nadler and E. P. Smith, Pattern Recognition Engineering, Wiley, 1993.
- R. T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, in Proceedings of the 20th International Conference on Very Large Databases, 1994, pp. 144-155.
- Pejtersen, A. M. (1998) Semantic information retrieval. Comm. ACM 41: pp. 90-92
- Rosen, J. B., Park, H., Glick, J. (1996) Total least norm formulation and solution for structured problems. SIAM J. Matrix Anal. Appl. 17: pp. 110-128
- Salton, G. (1971) The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, NJ
- G. Salton, and M. J. McGill, Introduction to Modern Information Retrieval, Mc-Graw Hill, 1983.
- S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999
- D. Zhang, R. Ramakrishan, and M. Livny, An efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.
- Lower Dimensional Representation of Text Data Based on Centroids and Least Squares
BIT Numerical Mathematics
Volume 43, Issue 2 , pp 427-448
- Cover Date
- Print ISSN
- Online ISSN
- Kluwer Academic Publishers
- Additional Links
- Dimension reduction
- least squares
- rank reducing decomposition
- feature extraction
- Industry Sectors
- Author Affiliations
- 1. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, U.S.A.
- 2. Department of Computer Science and Engineering, Univ. of California, Santa Barbara, Santa Barbara, CA, 93106, U.S.A.
- 3. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, U.S.A.
- 4. University of California, San Diego, La Jolla, CA, 92093, U.S.A.