Skip to main content
Log in

Lower Dimensional Representation of Text Data Based on Centroids and Least Squares

  • Published:
BIT Numerical Mathematics Aims and scope Submit manuscript

Abstract

Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed.

Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. M. R. Anderberg, Cluster Analysis for Applications, Academic Press, New York and London, 1973.

    Google Scholar 

  2. M. W. Berry, S. T. Dumais, and G. W. O'Brien, Using linear algebra for intelligent information retrieval, SIAM Review, 37 (1995), pp. 573-595.

    Google Scholar 

  3. M. W. Berry, Z. Drmac, and E. R. Jessup, Matrices, vector spaces, and information retrieval, SIAM Review, 41 (1999), pp. 335-362.

    Google Scholar 

  4. Å. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, PA, 1996.

    Google Scholar 

  5. J. R. Colon and S. J. Colon, Optimal use of an information retrieval system, J. Amer, Soc. Information Science, 47:6 (1996), pp. 449-457.

    Google Scholar 

  6. M. T. Chu, R. E. Funderlic, and G. H. Golub, A rank-reduction formula and its applications to matrix factorizations, SIAM Review, 37 (1995), pp. 512-530.

    Google Scholar 

  7. R. E. Cline and R. E. Funderlic, The rank of a difference of matrices and associated generalized inverses, Linear Algebra Appl., 24 (1979), pp. 185-215.

    Google Scholar 

  8. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Soc. Information Science, 41 (1990), pp. 391-407.

    Google Scholar 

  9. I. S. Dhillon and D. S. Modha, Concept Decompositions for large sparse text data using clustering, Machine Learning 421 (2001), pp. 143-175.

    Google Scholar 

  10. S. T. Dumais, Improving the retrieval of information from external sources, Behav-ior Research Methods, Instruments, & Computers, 23 (1991), pp. 229-236.

    Google Scholar 

  11. C. Eckart and G. Young, The approximation of one matrix by another lower rank, Psychometrika, 1 (1936), pp. 211-218.

    Google Scholar 

  12. W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Englewood Cliffs, NJ, 1992.

    Google Scholar 

  13. M. D. Gordon, Using latent semantic indexing for literature based discovery, J. Amer. Soc. Information Science, 498 (1998), pp. 674-685.

    Google Scholar 

  14. G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed., Johns Hopkins University Press, Baltimore, 1996.

    Google Scholar 

  15. E. Gose, R. Johnsonbaugh and S. Jost, Pattern Recognition and Image Analysis, Prentice-Hall, Upper Saddle River, NJ, 1996.

    Google Scholar 

  16. L. Guttman, A necessary and sufficient formula for matrix factoring, Psychometrika, 22 (1957), pp. 79-81.

    Google Scholar 

  17. S. Harter, Psychological relevance and information science, J. Amer. Soc. Information Science, 439 (1992), pp. 602-615.

    Google Scholar 

  18. H. S. Heaps, Information Retrieval, Computational and Theoretical Aspects, Academic Press, New York, 1978.

    Google Scholar 

  19. L. Hubert, J. Meulman, and W. Heiser, Two purposes for matrix factorization: A historical appraisal, SIAM Review, 421 (2000), pp. 68-82.

    Google Scholar 

  20. A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988.

  21. M. Jeon, Centroid-Based Dimension Reduction Methods for Classification of High Dimensional Text Data, Ph.D. Dissertation, University of Minnesota, June 2001.

  22. Y. Jung, H. Park, and D. Du, An Effective term-weighting scheme for information retrieval, Tech. Report TR00-008, Department of Computer Science and Engineering, University of Minnesota, MN, USA, 2000.

    Google Scholar 

  23. Y. Jung, H. Park, and D. Du, A Balanced term-weighting scheme for improved document comparison and classification, preprint, 2001.

  24. G. Kowalski, Information Retrieval System: Theory and Implementation, Kluwer Academic Publishers, Dordrect, 1997.

    Google Scholar 

  25. H. Kim, P. Howland, and H. Park, Text categorization using support vector machines with dimension reduction, Tech. Report TR 03-014, Department of Computer Sciece and Engineering, University of Minnesota, MN, USA, 2003.

    Google Scholar 

  26. T. G. Kolda, Limited-memory matrix methods with applications, Ph.D. dissertation, Applied Mathematics, University of Maryland, 1997.

  27. T. G. Kolda and D. P. O'Leary, A semi-discrete matrix decomposition for latent semantic indexing in information retrieval, ACM Trans. Information Systems, 16 (1996), pp. 322-346.

    Google Scholar 

  28. R. Krovetz and W. B. Croft, Lexical ambiguity and information retrieval. ACM Trans. Information Systems, 102 (1992), pp. 115-241.

    Google Scholar 

  29. M. Nadler and E. P. Smith, Pattern Recognition Engineering, Wiley, 1993.

  30. R. T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, in Proceedings of the 20th International Conference on Very Large Databases, 1994, pp. 144-155.

  31. A. M. Pejtersen, Semantic information retrieval, Comm. ACM, 414 (1998), pp. 90-92.

    Google Scholar 

  32. J. B. Rosen, H. Park, and J. Glick, Total least norm formulation and solution for structured problems, SIAM J. Matrix Anal. Appl., 17 (1996), pp. 110-128.

    Google Scholar 

  33. G. Salton, The SMART Retrieval System, Prentice-Hall, Englewood Cliffs, NJ, 1971.

    Google Scholar 

  34. G. Salton, and M. J. McGill, Introduction to Modern Information Retrieval, Mc-Graw Hill, 1983.

  35. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999

  36. D. Zhang, R. Ramakrishan, and M. Livny, An efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, H., Jeon, M. & Rosen, J.B. Lower Dimensional Representation of Text Data Based on Centroids and Least Squares. BIT Numerical Mathematics 43, 427–448 (2003). https://doi.org/10.1023/A:1026039313770

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1026039313770

Navigation