Understanding and Enhancing the Folding-In Method in Latent Semantic Indexing

  • Xiang Wang
  • Xiaoming Jin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4080)


Latent Semantic Indexing(LSI) has been proved to be effective to capture the semantic structure of document collections. It is widely used in content-based text retrieval. However, in many real-world applications dealing with very large document collections, LSI suffers from its high computational complexity, which comes from the process of Singular Value Decomposition(SVD). As a result, in practice, the folding-in method is widely used as an approximation to the LSI method. However, in practice, the folding-in method is generally implemented ”as is” and detailed analysis on its effectiveness and performance is left out. Consequentially, the performance of the folding-in method cannot be guaranteed. In this paper, we firstly illustrated the underlying principle of the folding-in method from a linear algebra point of view and analyzed some existing commonly used techniques. Based on the theoretical analysis, we proposed a novel algorithm to guide the implementation of the folding-in method. Our method was justified and evaluated by a series of experiments on various classical IR data sets. The results indicated that our method was effective and had consistent performance over different document collections.


Singular Value Decomposition Average Precision Document Collection Vector Space Model Semantic Structure 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  2. 2.
    Dumais, S.T.: LSI meets TREC: A status report. In: The First Text REtrieval Conference (TREC1), pp. 137–152 (1992)Google Scholar
  3. 3.
    Dumais, S.T.: Latent semantic indexing (LSI) and TREC-2. In: The Second Text REtrieval Conference (TREC2), pp. 105–116 (1993)Google Scholar
  4. 4.
    Dumais, S.T.: Latent semantic indexing (LSI): TREC-3 report. In: The Third Text REtrieval Conference (TREC3), pp. 105–115 (1994)Google Scholar
  5. 5.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)MATHGoogle Scholar
  6. 6.
    Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Rev. 37(4), 573–595 (1995)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Berry, M.W., Drmač, Z., Jessup, E.R.: Matrix, vector spaces, and information retrieval. SIAM Rev. 41(2), 335–362 (1999)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Kontostathis, A., Pottenger, W.M.: A framework for understanding LSI performance. In: Proceedings of ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval (ACMSIGIRMF/IR 2003) (2003)Google Scholar
  9. 9.
    Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1, 211–218 (1936)CrossRefGoogle Scholar
  10. 10.
    Dumais, S.: Enhancing performance in latent semantic indexing (LSI) retrieval. Technical Report TM-ARH-017527 (1990)Google Scholar
  11. 11.
    O’Brien, G.W.: Information management tools for updating an SVD-encoded indexing scheme. Master’s thesis, The University of Knoxville, Tennessee, TN (1994)Google Scholar
  12. 12.
    Fierro, R.D., Jiang, E.P.: Lanczos and the Riemannian SVD in information retrieval applications. Numer. Linear Algebra Appl. 12(4), 355–372 (2005)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Chen, C.-M., Stoffel, N., Post, M., Basu, C., Bassu, D., Behrens, C.: Telcordia LSI engine: Implementation and scalability issues. In: RIDE 2001: Proceedings of the 11th International Workshop on research Issues in Data Engineering (2001)Google Scholar
  14. 14.
    Tang, C., Dwarkadas, S., Xu, Z.: On scaling latent semantic indexing for large peer-to-peer systems. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 112–121 (2004)Google Scholar
  15. 15.
    Bassu, D., Behrens, C.: Distributed LSI: Scalable concept-based information retrieval with high semantic resolution. In: Proceedings of the 3rd SIAM International Conference on Data Mining (Text Mining Workshop) (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xiang Wang
    • 1
  • Xiaoming Jin
    • 1
  1. 1.School of SoftwareTsinghua UniversityBeijingChina

Personalised recommendations