, Volume 43, Issue 2, pp 427448
First online:
Lower Dimensional Representation of Text Data Based on Centroids and Least Squares
 Haesun ParkAffiliated withDepartment of Computer Science and Engineering, University of Minnesota
 , Moongu JeonAffiliated withDepartment of Computer Science and Engineering, Univ. of California, Santa Barbara
 , J. Ben RosenAffiliated withDepartment of Computer Science and Engineering, University of MinnesotaUniversity of California
Rent the article at a discount
Rent now* Final gross prices may vary according to local VAT.
Get AccessAbstract
Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed.
Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.
 Title
 Lower Dimensional Representation of Text Data Based on Centroids and Least Squares
 Journal

BIT Numerical Mathematics
Volume 43, Issue 2 , pp 427448
 Cover Date
 200306
 DOI
 10.1023/A:1026039313770
 Print ISSN
 00063835
 Online ISSN
 15729125
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Keywords

 Dimension reduction
 centroids
 least squares
 rank reducing decomposition
 classification
 feature extraction
 Industry Sectors
 Authors

 Haesun Park ^{(1)}
 Moongu Jeon ^{(2)}
 J. Ben Rosen ^{(3)} ^{(4)}
 Author Affiliations

 1. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, U.S.A.
 2. Department of Computer Science and Engineering, Univ. of California, Santa Barbara, Santa Barbara, CA, 93106, U.S.A.
 3. Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, U.S.A.
 4. University of California, San Diego, La Jolla, CA, 92093, U.S.A.