Advertisement

A Novel Updating Scheme for Probabilistic Latent Semantic Indexing

  • Constantine Kotropoulos
  • Athanasios Papaioannou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3955)

Abstract

Probabilistic Latent Semantic Indexing (PLSI) is a statistical technique for automatic document indexing. A novel method is proposed for updating PLSI when new documents arrive. The proposed method adds incrementally the words of any new document in the term-document matrix and derives the updating equations for the probability of terms given the class (i.e. latent) variables and the probability of documents given the latent variables. The performance of the proposed method is compared to that of the folding-in algorithm, which is an inexpensive, but potentially inaccurate updating method. It is demonstrated that the proposed updating algorithm outperforms the folding-in method with respect to the mean squared error between the aforementioned probabilities as they are estimated by the two updating methods and the original non-adaptive PLSI algorithm.

Keywords

Mean Square Error Latent Dirichlet Allocation Data Generation Process Vector Space Model Latent Topic 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)CrossRefMATHGoogle Scholar
  2. 2.
    Yates, R.B., Neto, B.R.: Modern Information Retrieval. ACM Press, New York (1999)Google Scholar
  3. 3.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal American Society of Information Science 41, 391–407 (1990)CrossRefGoogle Scholar
  4. 4.
    Hofmann, T.: Probabilistic latent semantic analysis. In: Proc. Uncertainty in Artificial Intelligence, UAI 1999, Stockholm (1999)Google Scholar
  5. 5.
    Hofmann, T., Puzicha, J.: Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Institute, Berkeley, CA (1998)Google Scholar
  6. 6.
    Saul, L., Pereira, F.: Aggregate and mixed-order Markov models for statistical language processing. In: Cardie, C., Weischedel, R. (eds.) Proc. 2nd Conf. Empirical Methods in Natural Language Processing, pp. 81–89. Association for Computational Linguistics, Somerset, New Jersey (1997)Google Scholar
  7. 7.
    Almpanidis, G., Kotropoulos, C.: Combining text and link analysis for focused crawling. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 278–287. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  8. 8.
    Bartell, B.T., Cottrell, G.W., Belew, R.K.: Latent semantic indexing is an optimal special case of multidimensional scaling. In: Proc. Research and Development in Information Retrieval, pp. 161–167 (1992)Google Scholar
  9. 9.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. Research and Development in Information Retrieval, pp. 50–57 (1999)Google Scholar
  10. 10.
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm (with discussion). Journal Royal Statistical Society, Series B 39, 1–38 (1977)MathSciNetMATHGoogle Scholar
  11. 11.
    Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models, 355–368 (1999)Google Scholar
  12. 12.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42, 177–196 (2001)CrossRefMATHGoogle Scholar
  13. 13.
    Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia (1999)MATHGoogle Scholar
  14. 14.
    Lang, K.: Newsweeder: Learning to filter netnews. In: Proc. 12th Int. Conf. Machine Learning, pp. 331–339 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Constantine Kotropoulos
    • 1
  • Athanasios Papaioannou
    • 1
  1. 1.Department of InformaticsAristotle University of ThessalonikiThessalonikiGreece

Personalised recommendations