BIT Numerical Mathematics

, Volume 43, Issue 2, pp 427–448 | Cite as

Lower Dimensional Representation of Text Data Based on Centroids and Least Squares

  • Haesun Park
  • Moongu Jeon
  • J. Ben Rosen
Article

Abstract

Dimension reduction in today's vector space based information retrieval system is essential for improving computational efficiency in handling massive amounts of data. A mathematical framework for lower dimensional representation of text data in vector space based information retrieval is proposed using minimization and a matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on the Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then two new methods for dimension reduction based on the centroids of data clusters are proposed and shown to be more efficient and effective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods in terms of computational efficiency and data representation in the reduced space, as well as their mathematical properties are discussed.

Experimental results are presented to illustrate the effectiveness of our methods on certain classification problems in a reduced dimensional space. The results indicate that for a successful lower dimensional representation of the data, it is important to incorporate a priori knowledge in the dimension reduction algorithms.

Dimension reduction centroids least squares rank reducing decomposition classification feature extraction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

REFERENCES

  1. 1.
    M. R. Anderberg, Cluster Analysis for Applications, Academic Press, New York and London, 1973.Google Scholar
  2. 2.
    M. W. Berry, S. T. Dumais, and G. W. O'Brien, Using linear algebra for intelligent information retrieval, SIAM Review, 37 (1995), pp. 573-595.Google Scholar
  3. 3.
    M. W. Berry, Z. Drmac, and E. R. Jessup, Matrices, vector spaces, and information retrieval, SIAM Review, 41 (1999), pp. 335-362.Google Scholar
  4. 4.
    Å. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, PA, 1996.Google Scholar
  5. 5.
    J. R. Colon and S. J. Colon, Optimal use of an information retrieval system, J. Amer, Soc. Information Science, 47:6 (1996), pp. 449-457.Google Scholar
  6. 6.
    M. T. Chu, R. E. Funderlic, and G. H. Golub, A rank-reduction formula and its applications to matrix factorizations, SIAM Review, 37 (1995), pp. 512-530.Google Scholar
  7. 7.
    R. E. Cline and R. E. Funderlic, The rank of a difference of matrices and associated generalized inverses, Linear Algebra Appl., 24 (1979), pp. 185-215.Google Scholar
  8. 8.
    S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Soc. Information Science, 41 (1990), pp. 391-407.Google Scholar
  9. 9.
    I. S. Dhillon and D. S. Modha, Concept Decompositions for large sparse text data using clustering, Machine Learning 421 (2001), pp. 143-175.Google Scholar
  10. 10.
    S. T. Dumais, Improving the retrieval of information from external sources, Behav-ior Research Methods, Instruments, & Computers, 23 (1991), pp. 229-236.Google Scholar
  11. 11.
    C. Eckart and G. Young, The approximation of one matrix by another lower rank, Psychometrika, 1 (1936), pp. 211-218.Google Scholar
  12. 12.
    W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Englewood Cliffs, NJ, 1992.Google Scholar
  13. 13.
    M. D. Gordon, Using latent semantic indexing for literature based discovery, J. Amer. Soc. Information Science, 498 (1998), pp. 674-685.Google Scholar
  14. 14.
    G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed., Johns Hopkins University Press, Baltimore, 1996.Google Scholar
  15. 15.
    E. Gose, R. Johnsonbaugh and S. Jost, Pattern Recognition and Image Analysis, Prentice-Hall, Upper Saddle River, NJ, 1996.Google Scholar
  16. 16.
    L. Guttman, A necessary and sufficient formula for matrix factoring, Psychometrika, 22 (1957), pp. 79-81.Google Scholar
  17. 17.
    S. Harter, Psychological relevance and information science, J. Amer. Soc. Information Science, 439 (1992), pp. 602-615.Google Scholar
  18. 18.
    H. S. Heaps, Information Retrieval, Computational and Theoretical Aspects, Academic Press, New York, 1978.Google Scholar
  19. 19.
    L. Hubert, J. Meulman, and W. Heiser, Two purposes for matrix factorization: A historical appraisal, SIAM Review, 421 (2000), pp. 68-82.Google Scholar
  20. 20.
    A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988.Google Scholar
  21. 21.
    M. Jeon, Centroid-Based Dimension Reduction Methods for Classification of High Dimensional Text Data, Ph.D. Dissertation, University of Minnesota, June 2001.Google Scholar
  22. 22.
    Y. Jung, H. Park, and D. Du, An Effective term-weighting scheme for information retrieval, Tech. Report TR00-008, Department of Computer Science and Engineering, University of Minnesota, MN, USA, 2000.Google Scholar
  23. 23.
    Y. Jung, H. Park, and D. Du, A Balanced term-weighting scheme for improved document comparison and classification, preprint, 2001.Google Scholar
  24. 24.
    G. Kowalski, Information Retrieval System: Theory and Implementation, Kluwer Academic Publishers, Dordrect, 1997.Google Scholar
  25. 25.
    H. Kim, P. Howland, and H. Park, Text categorization using support vector machines with dimension reduction, Tech. Report TR 03-014, Department of Computer Sciece and Engineering, University of Minnesota, MN, USA, 2003.Google Scholar
  26. 26.
    T. G. Kolda, Limited-memory matrix methods with applications, Ph.D. dissertation, Applied Mathematics, University of Maryland, 1997.Google Scholar
  27. 27.
    T. G. Kolda and D. P. O'Leary, A semi-discrete matrix decomposition for latent semantic indexing in information retrieval, ACM Trans. Information Systems, 16 (1996), pp. 322-346.Google Scholar
  28. 28.
    R. Krovetz and W. B. Croft, Lexical ambiguity and information retrieval. ACM Trans. Information Systems, 102 (1992), pp. 115-241.Google Scholar
  29. 29.
    M. Nadler and E. P. Smith, Pattern Recognition Engineering, Wiley, 1993.Google Scholar
  30. 30.
    R. T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, in Proceedings of the 20th International Conference on Very Large Databases, 1994, pp. 144-155.Google Scholar
  31. 31.
    A. M. Pejtersen, Semantic information retrieval, Comm. ACM, 414 (1998), pp. 90-92.Google Scholar
  32. 32.
    J. B. Rosen, H. Park, and J. Glick, Total least norm formulation and solution for structured problems, SIAM J. Matrix Anal. Appl., 17 (1996), pp. 110-128.Google Scholar
  33. 33.
    G. Salton, The SMART Retrieval System, Prentice-Hall, Englewood Cliffs, NJ, 1971.Google Scholar
  34. 34.
    G. Salton, and M. J. McGill, Introduction to Modern Information Retrieval, Mc-Graw Hill, 1983.Google Scholar
  35. 35.
    S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999Google Scholar
  36. 36.
    D. Zhang, R. Ramakrishan, and M. Livny, An efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, June 1996.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Haesun Park
    • 1
  • Moongu Jeon
    • 2
  • J. Ben Rosen
    • 3
    • 4
  1. 1.Department of Computer Science and EngineeringUniversity of MinnesotaMinneapolisU.S.A.
  2. 2.Department of Computer Science and EngineeringUniv. of California, Santa BarbaraSanta BarbaraU.S.A.
  3. 3.Department of Computer Science and EngineeringUniversity of MinnesotaMinneapolisU.S.A.
  4. 4.University of CaliforniaSan Diego, La JollaU.S.A.

Personalised recommendations