Skip to main content

P2LSA and P2LSA+: Two Paralleled Probabilistic Latent Semantic Analysis Algorithms Based on the MapReduce Model

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6936))

Abstract

Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P2LSA and P2LSA+, respectively. When dealing with a large-scale data set, P2LSA and P2LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P2LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P2LSA, the Map function in P2LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P2LSA and P2LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P2LSA+ is about 3 times faster than P2LSA.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM, New York (1999)

    Google Scholar 

  2. Kong, S.Y., Shan Lee, L.: Improved spoken document summarization using probabilistic latent semantic analysis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2006, vol. I, pp. 941–944 (2006)

    Google Scholar 

  3. Newman, D.J., Block, S.: Probabilistic topic decomposition of an eighteenth-century american newspaper. J. Am. Soc. Inf. Sci. Technol. 57(15), 753–767 (2006)

    Article  Google Scholar 

  4. Wan, R., Anh, V.N., Mamitsuka, H.: Efficient probabilistic latent semantic analysis through parallelization. In: Lee, G.G., Song, D., Lin, C.-Y., Aizawa, A., Kuriyama, K., Yoshioka, M., Sakai, T. (eds.) AIRS 2009. LNCS, vol. 5839, pp. 432–443. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Hong, C., Chen, W., Zheng, W., Shan, J., Chen, Y., Zhang, Y.: Parallelization and characterization of probabilistic latent semantic analysis. In: Proc. 37th International Conference on Parallel Processing, pp. 628–635 (2008)

    Google Scholar 

  6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters, pp. 10–10. USENIX Association (2004)

    Google Scholar 

  7. Jin, X., Zhou, Y., Mobasher, B.: Web usage mining based on probabilistic latent semantic analysis. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 197–205. ACM Press, New York (2004)

    Google Scholar 

  8. Hofmann, T.: Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. 22(27), 89–115 (2007)

    Google Scholar 

  9. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  10. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2009)

    Google Scholar 

  11. MovieLens: Movielens datasets of the university of minnesota, http://www.movielen.org

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jin, Y., Gao, Y., Shi, Y., Shang, L., Wang, R., Yang, Y. (2011). P2LSA and P2LSA+: Two Paralleled Probabilistic Latent Semantic Analysis Algorithms Based on the MapReduce Model. In: Yin, H., Wang, W., Rayward-Smith, V. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2011. IDEAL 2011. Lecture Notes in Computer Science, vol 6936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23878-9_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23878-9_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23877-2

  • Online ISBN: 978-3-642-23878-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics