Lower Dimensional Representation of Text Data Based on Centroids and Least Squares
 Haesun Park,
 Moongu Jeon,
 J. Ben Rosen
 … show all 3 hide
Purchase on Springer.com
$39.95 / €34.95 / £29.95*
Rent the article at a discount
Rent now* Final gross prices may vary according to local VAT.
Abstract
Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed.
Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.
 M. R. Anderberg, Cluster Analysis for Applications, Academic Press, New York and London, 1973.
 M. W. Berry, S. T. Dumais, and G. W. O'Brien, Using linear algebra for intelligent information retrieval, SIAM Review, 37 (1995), pp. 573595.
 M. W. Berry, Z. Drmac, and E. R. Jessup, Matrices, vector spaces, and information retrieval, SIAM Review, 41 (1999), pp. 335362.
 Å. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, PA, 1996.
 J. R. Colon and S. J. Colon, Optimal use of an information retrieval system, J. Amer, Soc. Information Science, 47:6 (1996), pp. 449457.
 M. T. Chu, R. E. Funderlic, and G. H. Golub, A rankreduction formula and its applications to matrix factorizations, SIAM Review, 37 (1995), pp. 512530.
 R. E. Cline and R. E. Funderlic, The rank of a difference of matrices and associated generalized inverses, Linear Algebra Appl., 24 (1979), pp. 185215.
 S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Soc. Information Science, 41 (1990), pp. 391407.
 I. S. Dhillon and D. S. Modha, Concept Decompositions for large sparse text data using clustering, Machine Learning 421 (2001), pp. 143175.
 S. T. Dumais, Improving the retrieval of information from external sources, Behavior Research Methods, Instruments, & Computers, 23 (1991), pp. 229236.
 C. Eckart and G. Young, The approximation of one matrix by another lower rank, Psychometrika, 1 (1936), pp. 211218.
 W. B. Frakes and R. BaezaYates, Information Retrieval: Data Structures and Algorithms, PrenticeHall, Englewood Cliffs, NJ, 1992.
 M. D. Gordon, Using latent semantic indexing for literature based discovery, J. Amer. Soc. Information Science, 498 (1998), pp. 674685.
 G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed., Johns Hopkins University Press, Baltimore, 1996.
 E. Gose, R. Johnsonbaugh and S. Jost, Pattern Recognition and Image Analysis, PrenticeHall, Upper Saddle River, NJ, 1996.
 L. Guttman, A necessary and sufficient formula for matrix factoring, Psychometrika, 22 (1957), pp. 7981.
 S. Harter, Psychological relevance and information science, J. Amer. Soc. Information Science, 439 (1992), pp. 602615.
 H. S. Heaps, Information Retrieval, Computational and Theoretical Aspects, Academic Press, New York, 1978.
 L. Hubert, J. Meulman, and W. Heiser, Two purposes for matrix factorization: A historical appraisal, SIAM Review, 421 (2000), pp. 6882.
 A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, PrenticeHall, 1988.
 M. Jeon, CentroidBased Dimension Reduction Methods for Classification of High Dimensional Text Data, Ph.D. Dissertation, University of Minnesota, June 2001.
 Y. Jung, H. Park, and D. Du, An Effective termweighting scheme for information retrieval, Tech. Report TR00008, Department of Computer Science and Engineering, University of Minnesota, MN, USA, 2000.
 Y. Jung, H. Park, and D. Du, A Balanced termweighting scheme for improved document comparison and classification, preprint, 2001.
 G. Kowalski, Information Retrieval System: Theory and Implementation, Kluwer Academic Publishers, Dordrect, 1997.
 H. Kim, P. Howland, and H. Park, Text categorization using support vector machines with dimension reduction, Tech. Report TR 03014, Department of Computer Sciece and Engineering, University of Minnesota, MN, USA, 2003.
 T. G. Kolda, Limitedmemory matrix methods with applications, Ph.D. dissertation, Applied Mathematics, University of Maryland, 1997.
 T. G. Kolda and D. P. O'Leary, A semidiscrete matrix decomposition for latent semantic indexing in information retrieval, ACM Trans. Information Systems, 16 (1996), pp. 322346.
 R. Krovetz and W. B. Croft, Lexical ambiguity and information retrieval. ACM Trans. Information Systems, 102 (1992), pp. 115241.
 M. Nadler and E. P. Smith, Pattern Recognition Engineering, Wiley, 1993.
 R. T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, in Proceedings of the 20th International Conference on Very Large Databases, 1994, pp. 144155.
 A. M. Pejtersen, Semantic information retrieval, Comm. ACM, 414 (1998), pp. 9092.
 J. B. Rosen, H. Park, and J. Glick, Total least norm formulation and solution for structured problems, SIAM J. Matrix Anal. Appl., 17 (1996), pp. 110128.
 G. Salton, The SMART Retrieval System, PrenticeHall, Englewood Cliffs, NJ, 1971.
 G. Salton, and M. J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983.
 S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999
 D. Zhang, R. Ramakrishan, and M. Livny, An efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.
 Title
 Lower Dimensional Representation of Text Data Based on Centroids and Least Squares
 Journal

BIT Numerical Mathematics
Volume 43, Issue 2 , pp 427448
 Cover Date
 20030601
 DOI
 10.1023/A:1026039313770
 Print ISSN
 00063835
 Online ISSN
 15729125
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Keywords

 Dimension reduction
 centroids
 least squares
 rank reducing decomposition
 classification
 feature extraction
 Industry Sectors
 Authors

 Haesun Park ^{(1)}
 Moongu Jeon ^{(2)}
 J. Ben Rosen ^{(3)} ^{(4)}
 Author Affiliations

 1. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, U.S.A.
 2. Department of Computer Science and Engineering, Univ. of California, Santa Barbara, Santa Barbara, CA, 93106, U.S.A.
 3. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, U.S.A.
 4. University of California, San Diego, La Jolla, CA, 92093, U.S.A.