Abstract
Probabilistic latent semantic analysis (PLSA) is considered an effective technique for information retrieval, but has one notable drawback: its dramatic consumption of computing resources, in terms of both execution time and internal memory. This drawback limits the practical application of the technique only to document collections of modest size.
In this paper, we look into the practice of implementing PLSA with the aim of improving its efficiency without changing its output. Recently, Hong et al. [2008] has shown how the execution time of PLSA can be improved by employing OpenMP for shared memory parallelization. We extend their work by also studying the effects from using it in combination with the Message Passing Interface (MPI) for distributed memory parallelization. We show how a more careful implementation of PLSA reduces execution time and memory costs by applying our method on several text collections commonly used in the literature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chang, J.-M., Su, E.C.-Y., Lo, A., Chiu, H.-S., Sung, T.-Y., Hsu, W.-L.: PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins 72(2), 693–710 (2008)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39(1), 1–38 (1977)
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004), http://www.open-mpi.org/
Hanselmann, M., Kirchner, M., Renard, B.Y., Amstalden, E.R., Glunde, K., Heeren, R.M.A., Hamprecht, F.A.: Concise representation of mass spectrometry images by probabilistic latent semantic analysis. Analytical Chemistry (November 2008), ISSN 1520-6882
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1–2), 177–196 (2001)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 50–57. ACM Press, New York (1999)
Hong, C., Chen, W., Zheng, W., Shan, J., Chen, Y., Zhang, Y.: Parellelization and characterization of probabilistic latent semantic analysis. In: Proc. 37th International Conference on Parallel Processing, pp. 628–635 (2008)
Kim, Y.-S., Chang, J.-H., Zhang, B.-T.: An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 111–116. Springer, Heidelberg (2003)
Mamitsuka, H.: Hierarchical latent knowledge analysis for co-occurrence data. In: Proc. 20th International Conference on Machine Learning, pp. 504–511 (2003)
Message Passing Interface Forum. MPI: A message-passing interface standard, version 2.1(June 2008), http://www.mpi-forum.org/docs/docs.html
OpenMP Architecture Review Board. OpenMP application programming interface, version 3.0 (May 2008), http://openmp.org/wp/openmp-specifications/
Owens, J.D., Houston, M., Luebke, D., Stone, J.E., Philips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008), http://gpgpu.org/
Park, L.A.F., Ramamohanarao, K.: Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The VLDB Journal 18(1), 141–155 (2009)
Quinn, M.J.: Parallel Programming in C with MPI and OpenMP. McGraw-Hill, New York (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wan, R., Anh, V.N., Mamitsuka, H. (2009). Efficient Probabilistic Latent Semantic Analysis through Parallelization. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_38
Download citation
DOI: https://doi.org/10.1007/978-3-642-04769-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04768-8
Online ISBN: 978-3-642-04769-5
eBook Packages: Computer ScienceComputer Science (R0)