Abstract
Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed.
Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.
Similar content being viewed by others
REFERENCES
M. R. Anderberg, Cluster Analysis for Applications, Academic Press, New York and London, 1973.
M. W. Berry, S. T. Dumais, and G. W. O'Brien, Using linear algebra for intelligent information retrieval, SIAM Review, 37 (1995), pp. 573-595.
M. W. Berry, Z. Drmac, and E. R. Jessup, Matrices, vector spaces, and information retrieval, SIAM Review, 41 (1999), pp. 335-362.
Å. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, PA, 1996.
J. R. Colon and S. J. Colon, Optimal use of an information retrieval system, J. Amer, Soc. Information Science, 47:6 (1996), pp. 449-457.
M. T. Chu, R. E. Funderlic, and G. H. Golub, A rank-reduction formula and its applications to matrix factorizations, SIAM Review, 37 (1995), pp. 512-530.
R. E. Cline and R. E. Funderlic, The rank of a difference of matrices and associated generalized inverses, Linear Algebra Appl., 24 (1979), pp. 185-215.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Soc. Information Science, 41 (1990), pp. 391-407.
I. S. Dhillon and D. S. Modha, Concept Decompositions for large sparse text data using clustering, Machine Learning 421 (2001), pp. 143-175.
S. T. Dumais, Improving the retrieval of information from external sources, Behav-ior Research Methods, Instruments, & Computers, 23 (1991), pp. 229-236.
C. Eckart and G. Young, The approximation of one matrix by another lower rank, Psychometrika, 1 (1936), pp. 211-218.
W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Englewood Cliffs, NJ, 1992.
M. D. Gordon, Using latent semantic indexing for literature based discovery, J. Amer. Soc. Information Science, 498 (1998), pp. 674-685.
G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed., Johns Hopkins University Press, Baltimore, 1996.
E. Gose, R. Johnsonbaugh and S. Jost, Pattern Recognition and Image Analysis, Prentice-Hall, Upper Saddle River, NJ, 1996.
L. Guttman, A necessary and sufficient formula for matrix factoring, Psychometrika, 22 (1957), pp. 79-81.
S. Harter, Psychological relevance and information science, J. Amer. Soc. Information Science, 439 (1992), pp. 602-615.
H. S. Heaps, Information Retrieval, Computational and Theoretical Aspects, Academic Press, New York, 1978.
L. Hubert, J. Meulman, and W. Heiser, Two purposes for matrix factorization: A historical appraisal, SIAM Review, 421 (2000), pp. 68-82.
A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988.
M. Jeon, Centroid-Based Dimension Reduction Methods for Classification of High Dimensional Text Data, Ph.D. Dissertation, University of Minnesota, June 2001.
Y. Jung, H. Park, and D. Du, An Effective term-weighting scheme for information retrieval, Tech. Report TR00-008, Department of Computer Science and Engineering, University of Minnesota, MN, USA, 2000.
Y. Jung, H. Park, and D. Du, A Balanced term-weighting scheme for improved document comparison and classification, preprint, 2001.
G. Kowalski, Information Retrieval System: Theory and Implementation, Kluwer Academic Publishers, Dordrect, 1997.
H. Kim, P. Howland, and H. Park, Text categorization using support vector machines with dimension reduction, Tech. Report TR 03-014, Department of Computer Sciece and Engineering, University of Minnesota, MN, USA, 2003.
T. G. Kolda, Limited-memory matrix methods with applications, Ph.D. dissertation, Applied Mathematics, University of Maryland, 1997.
T. G. Kolda and D. P. O'Leary, A semi-discrete matrix decomposition for latent semantic indexing in information retrieval, ACM Trans. Information Systems, 16 (1996), pp. 322-346.
R. Krovetz and W. B. Croft, Lexical ambiguity and information retrieval. ACM Trans. Information Systems, 102 (1992), pp. 115-241.
M. Nadler and E. P. Smith, Pattern Recognition Engineering, Wiley, 1993.
R. T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, in Proceedings of the 20th International Conference on Very Large Databases, 1994, pp. 144-155.
A. M. Pejtersen, Semantic information retrieval, Comm. ACM, 414 (1998), pp. 90-92.
J. B. Rosen, H. Park, and J. Glick, Total least norm formulation and solution for structured problems, SIAM J. Matrix Anal. Appl., 17 (1996), pp. 110-128.
G. Salton, The SMART Retrieval System, Prentice-Hall, Englewood Cliffs, NJ, 1971.
G. Salton, and M. J. McGill, Introduction to Modern Information Retrieval, Mc-Graw Hill, 1983.
S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999
D. Zhang, R. Ramakrishan, and M. Livny, An efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Park, H., Jeon, M. & Rosen, J.B. Lower Dimensional Representation of Text Data Based on Centroids and Least Squares. BIT Numerical Mathematics 43, 427–448 (2003). https://doi.org/10.1023/A:1026039313770
Issue Date:
DOI: https://doi.org/10.1023/A:1026039313770