Efficient Probabilistic Latent Semantic Analysis through Parallelization

  • Raymond Wan
  • Vo Ngoc Anh
  • Hiroshi Mamitsuka
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5839)

Abstract

Probabilistic latent semantic analysis (PLSA) is considered an effective technique for information retrieval, but has one notable drawback: its dramatic consumption of computing resources, in terms of both execution time and internal memory. This drawback limits the practical application of the technique only to document collections of modest size.

In this paper, we look into the practice of implementing PLSA with the aim of improving its efficiency without changing its output. Recently, Hong et al. [2008] has shown how the execution time of PLSA can be improved by employing OpenMP for shared memory parallelization. We extend their work by also studying the effects from using it in combination with the Message Passing Interface (MPI) for distributed memory parallelization. We show how a more careful implementation of PLSA reduces execution time and memory costs by applying our method on several text collections commonly used in the literature.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chang, J.-M., Su, E.C.-Y., Lo, A., Chiu, H.-S., Sung, T.-Y., Hsu, W.-L.: PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins 72(2), 693–710 (2008)CrossRefGoogle Scholar
  2. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  3. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39(1), 1–38 (1977)MathSciNetMATHGoogle Scholar
  4. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004), http://www.open-mpi.org/ CrossRefGoogle Scholar
  5. Hanselmann, M., Kirchner, M., Renard, B.Y., Amstalden, E.R., Glunde, K., Heeren, R.M.A., Hamprecht, F.A.: Concise representation of mass spectrometry images by probabilistic latent semantic analysis. Analytical Chemistry (November 2008), ISSN 1520-6882Google Scholar
  6. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1–2), 177–196 (2001)CrossRefMATHGoogle Scholar
  7. Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 50–57. ACM Press, New York (1999)Google Scholar
  8. Hong, C., Chen, W., Zheng, W., Shan, J., Chen, Y., Zhang, Y.: Parellelization and characterization of probabilistic latent semantic analysis. In: Proc. 37th International Conference on Parallel Processing, pp. 628–635 (2008)Google Scholar
  9. Kim, Y.-S., Chang, J.-H., Zhang, B.-T.: An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 111–116. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. Mamitsuka, H.: Hierarchical latent knowledge analysis for co-occurrence data. In: Proc. 20th International Conference on Machine Learning, pp. 504–511 (2003)Google Scholar
  11. Message Passing Interface Forum. MPI: A message-passing interface standard, version 2.1(June 2008), http://www.mpi-forum.org/docs/docs.html
  12. OpenMP Architecture Review Board. OpenMP application programming interface, version 3.0 (May 2008), http://openmp.org/wp/openmp-specifications/
  13. Owens, J.D., Houston, M., Luebke, D., Stone, J.E., Philips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008), http://gpgpu.org/ CrossRefGoogle Scholar
  14. Park, L.A.F., Ramamohanarao, K.: Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The VLDB Journal 18(1), 141–155 (2009)CrossRefGoogle Scholar
  15. Quinn, M.J.: Parallel Programming in C with MPI and OpenMP. McGraw-Hill, New York (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Raymond Wan
    • 1
    • 3
  • Vo Ngoc Anh
    • 2
  • Hiroshi Mamitsuka
    • 1
  1. 1.Bioinformatics Center, Institute for Chemical ResearchKyoto UniversityGokashoJapan
  2. 2.Department of Computer Science and Software EngineeringUniversity of MelbourneVictoriaAustralia
  3. 3.Computational Biology Research Center, AISTTokyoJapan

Personalised recommendations