P2LSA and P2LSA+: Two Paralleled Probabilistic Latent Semantic Analysis Algorithms Based on the MapReduce Model

Jin, Yan; Gao, Yang; Shi, Yinghuan; Shang, Lin; Wang, Ruili; Yang, Yubin

doi:10.1007/978-3-642-23878-9_46

P²LSA and P²LSA+: Two Paralleled Probabilistic Latent Semantic Analysis Algorithms Based on the MapReduce Model

Yan Jin¹⁹,
Yang Gao¹⁹,
Yinghuan Shi¹⁹,
Lin Shang¹⁹,
Ruili Wang²⁰ &
…
Yubin Yang¹⁹

Conference paper

1899 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6936))

Abstract

Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P²LSA and P²LSA+, respectively. When dealing with a large-scale data set, P²LSA and P²LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P²LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P²LSA, the Map function in P²LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P²LSA and P²LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P²LSA+ is about 3 times faster than P²LSA.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM, New York (1999)
Google Scholar
Kong, S.Y., Shan Lee, L.: Improved spoken document summarization using probabilistic latent semantic analysis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2006, vol. I, pp. 941–944 (2006)
Google Scholar
Newman, D.J., Block, S.: Probabilistic topic decomposition of an eighteenth-century american newspaper. J. Am. Soc. Inf. Sci. Technol. 57(15), 753–767 (2006)
Article Google Scholar
Wan, R., Anh, V.N., Mamitsuka, H.: Efficient probabilistic latent semantic analysis through parallelization. In: Lee, G.G., Song, D., Lin, C.-Y., Aizawa, A., Kuriyama, K., Yoshioka, M., Sakai, T. (eds.) AIRS 2009. LNCS, vol. 5839, pp. 432–443. Springer, Heidelberg (2009)
Chapter Google Scholar
Hong, C., Chen, W., Zheng, W., Shan, J., Chen, Y., Zhang, Y.: Parallelization and characterization of probabilistic latent semantic analysis. In: Proc. 37th International Conference on Parallel Processing, pp. 628–635 (2008)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters, pp. 10–10. USENIX Association (2004)
Google Scholar
Jin, X., Zhou, Y., Mobasher, B.: Web usage mining based on probabilistic latent semantic analysis. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 197–205. ACM Press, New York (2004)
Google Scholar
Hofmann, T.: Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. 22(27), 89–115 (2007)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2009)
Google Scholar
MovieLens: Movielens datasets of the university of minnesota, http://www.movielen.org

Download references

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China
Yan Jin, Yang Gao, Yinghuan Shi, Lin Shang & Yubin Yang
School of Engineering and Advanced Technology, Massey University, Palmerston North, New Zealand
Ruili Wang

Authors

Yan Jin
View author publications
You can also search for this author in PubMed Google Scholar
Yang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yinghuan Shi
View author publications
You can also search for this author in PubMed Google Scholar
Lin Shang
View author publications
You can also search for this author in PubMed Google Scholar
Ruili Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yubin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Electronic Engineering, University of Manchester, Sackville Street Building, M60 1QD, Manchester, UK
Hujun Yin
School of Computing Sciences, University of East Anglia, NR4 7TJ, Norwich, UK
Wenjia Wang
University of East Anglia, NR4 7TJ, Norwich, UK
Victor Rayward-Smith

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jin, Y., Gao, Y., Shi, Y., Shang, L., Wang, R., Yang, Y. (2011). P²LSA and P²LSA+: Two Paralleled Probabilistic Latent Semantic Analysis Algorithms Based on the MapReduce Model. In: Yin, H., Wang, W., Rayward-Smith, V. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2011. IDEAL 2011. Lecture Notes in Computer Science, vol 6936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23878-9_46

Download citation

DOI: https://doi.org/10.1007/978-3-642-23878-9_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23877-2
Online ISBN: 978-3-642-23878-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics