Skip to main content

Parallel Randomized Block Coordinate Descent for Neural Probabilistic Language Model with High-Dimensional Output Targets

  • Conference paper
  • First Online:
Pattern Recognition (CCPR 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 663))

Included in the following conference series:

  • 2307 Accesses

Abstract

Training a large probabilistic neural network language model, with typical high-dimensional output is excessively time-consuming, which is one of the main reasons that more simplified models such as n-gram is often more popular despite the inferior performance. In this paper a Chinese neural probabilistic language model is trained using the Fudan Chinese Language Corpus. As hundreds of thousands of distinct words have been tokenized from the raw corpus, the model contains tens of millions of parameters. To address the challenge, popular parallel computing platform MPI (Message Passing Interface) based on cluster is employed to implement the parallel neural network language model. Specifically, we propose a new method termed as Parallel Randomized Block Coordinate Descent (PRBCD) to train this model cost-effectively. Different from traditional coordinate descent method, our new method could be employed in network with multiple layers, allowing scaling up the gradients with respect to hidden units proportionally based on sampled parameters. We empirically show that our PRBCD is stable and is well suited for language models, which contain only a few layers while often have a large amount of parameters and extremely high-dimensional output targets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here ‘identical’ means each processor keeps the exactly same parameters and training samples in the embedding layer and hidden layer. While when the proposed PRBCD as will be discussed later in the paper is applied, they are not strictly identical.

  2. 2.

    http://www.mpich.org/static/docs/latest/www3/MPI_Allreduce.html.

  3. 3.

    See http://www.mpi-forum.org.

  4. 4.

    See https://github.com/fxsjy/jieba/.

  5. 5.

    In parallel computing, an embarrassingly parallel workload or problem is one where little or no effort is needed to separate it into multiple parallel tasks [7].

  6. 6.

    http://www.datatang.com/data/44139, http://www.datatang.com/data/43543.

  7. 7.

    Overall categories include Agriculture, Art, Communication, Computer, Economy, Education, Electronics, Energy, Environment, History, Law, Literature, Medical, Military, Mine, Philosophy, Politics, Space, Sports, Transport.

  8. 8.

    Toolbox: https://code.google.com/p/word2vec/.

References

  1. Bengio, Y., Schwenk, H., Senécal, J.S., Morin, F., Gauvain, J.L.: Neural probabilistic language models. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Machine Learning, pp. 137–186. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)

    Google Scholar 

  3. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  4. Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)

    Article  Google Scholar 

  5. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)

    Google Scholar 

  6. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)

    Google Scholar 

  7. Herlihy, M., Shavit, N.: The art of multiprocessor programming. Revised Reprint (2012)

    Google Scholar 

  8. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)

  9. Jelinek, F.: Interpolated estimation of markov source parameters from sparse data. In: Pattern Recognition in Practice (1980)

    Google Scholar 

  10. Kingma, D.P., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representation (2015)

    Google Scholar 

  11. Lai, S., Liu, K., Xu, L., Zhao, J.: How to generate a good word embedding? arXiv preprint arXiv:1507.05523 (2015)

  12. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, vol. 30, p. 1 (2013)

    Google Scholar 

  13. Mikolov, T., Deoras, A., Kombrink, S., Burget, L., Cernocky, J.: Empirical evaluation and combination of advanced language modeling techniques. In: Proceedings of Interspeech, pp. 605–608 (2011)

    Google Scholar 

  14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  15. Mnih, A., Hinton, G.: A scalable hierarchical distributed language model (2009)

    Google Scholar 

  16. Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models (2012)

    Google Scholar 

  17. Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model, pp. 246–252 (2005)

    Google Scholar 

  18. Sainath, T., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets, pp. 6655–6659 (2013)

    Google Scholar 

  19. Schwenk, H., Gauvain, J.L.: Training neural network language models on very large corpora, pp. 201–208 (2005)

    Google Scholar 

  20. Tieleman, T., Hinton, G.: Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4, 2 (2012)

    Google Scholar 

  21. Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theor. 13(2), 260–269 (1967)

    Article  MATH  Google Scholar 

  22. Zeiler, M.D., Fergus, R.: Stochastic pooling for regularization of deep convolutional neural networks. In: ICLR (2013)

    Google Scholar 

  23. Zhao, T., Yu, M., Wang, Y., Arora, R., Liu, H.: Accelerated mini-batch randomized block coordinate descent method. In: Advances in Neural Information Processing Systems, pp. 3329–3337 (2014)

    Google Scholar 

Download references

Acknowledgement

This work is partially supported by China Postdoctoral Science Foundation Funded Project (2016M590337), NSFC (11501210), Shanghai YangFan Plan (15YF1403400), and Shanghai Science and Technology Committee Project (15JC1401700).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junchi Yan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Liu, X., Yan, J., Wang, X., Zha, H. (2016). Parallel Randomized Block Coordinate Descent for Neural Probabilistic Language Model with High-Dimensional Output Targets. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_28

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3005-5_28

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3004-8

  • Online ISBN: 978-981-10-3005-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics