Parallel Randomized Block Coordinate Descent for Neural Probabilistic Language Model with High-Dimensional Output Targets

Liu, Xin; Yan, Junchi; Wang, Xiangfeng; Zha, Hongyuan

doi:10.1007/978-981-10-3005-5_28

Xin Liu¹⁶,
Junchi Yan^16,17,
Xiangfeng Wang¹⁶ &
…
Hongyuan Zha^16,18

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 663))

Included in the following conference series:

Chinese Conference on Pattern Recognition

2307 Accesses

Abstract

Training a large probabilistic neural network language model, with typical high-dimensional output is excessively time-consuming, which is one of the main reasons that more simplified models such as n-gram is often more popular despite the inferior performance. In this paper a Chinese neural probabilistic language model is trained using the Fudan Chinese Language Corpus. As hundreds of thousands of distinct words have been tokenized from the raw corpus, the model contains tens of millions of parameters. To address the challenge, popular parallel computing platform MPI (Message Passing Interface) based on cluster is employed to implement the parallel neural network language model. Specifically, we propose a new method termed as Parallel Randomized Block Coordinate Descent (PRBCD) to train this model cost-effectively. Different from traditional coordinate descent method, our new method could be employed in network with multiple layers, allowing scaling up the gradients with respect to hidden units proportionally based on sampled parameters. We empirically show that our PRBCD is stable and is well suited for language models, which contain only a few layers while often have a large amount of parameters and extremely high-dimensional output targets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here ‘identical’ means each processor keeps the exactly same parameters and training samples in the embedding layer and hidden layer. While when the proposed PRBCD as will be discussed later in the paper is applied, they are not strictly identical.
2.
http://www.mpich.org/static/docs/latest/www3/MPI_Allreduce.html.
3.
See http://www.mpi-forum.org.
4.
See https://github.com/fxsjy/jieba/.
5.
In parallel computing, an embarrassingly parallel workload or problem is one where little or no effort is needed to separate it into multiple parallel tasks [7].
6.
http://www.datatang.com/data/44139, http://www.datatang.com/data/43543.
7.
Overall categories include Agriculture, Art, Communication, Computer, Economy, Education, Electronics, Energy, Environment, History, Law, Literature, Medical, Military, Mine, Philosophy, Politics, Space, Sports, Transport.
8.
Toolbox: https://code.google.com/p/word2vec/.

References

Bengio, Y., Schwenk, H., Senécal, J.S., Morin, F., Gauvain, J.L.: Neural probabilistic language models. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Machine Learning, pp. 137–186. Springer, Heidelberg (2006)
Chapter Google Scholar
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)
Article Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)
Google Scholar
Herlihy, M., Shavit, N.: The art of multiprocessor programming. Revised Reprint (2012)
Google Scholar
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Jelinek, F.: Interpolated estimation of markov source parameters from sparse data. In: Pattern Recognition in Practice (1980)
Google Scholar
Kingma, D.P., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representation (2015)
Google Scholar
Lai, S., Liu, K., Xu, L., Zhao, J.: How to generate a good word embedding? arXiv preprint arXiv:1507.05523 (2015)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, vol. 30, p. 1 (2013)
Google Scholar
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., Cernocky, J.: Empirical evaluation and combination of advanced language modeling techniques. In: Proceedings of Interspeech, pp. 605–608 (2011)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mnih, A., Hinton, G.: A scalable hierarchical distributed language model (2009)
Google Scholar
Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models (2012)
Google Scholar
Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model, pp. 246–252 (2005)
Google Scholar
Sainath, T., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets, pp. 6655–6659 (2013)
Google Scholar
Schwenk, H., Gauvain, J.L.: Training neural network language models on very large corpora, pp. 201–208 (2005)
Google Scholar
Tieleman, T., Hinton, G.: Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4, 2 (2012)
Google Scholar
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theor. 13(2), 260–269 (1967)
Article MATH Google Scholar
Zeiler, M.D., Fergus, R.: Stochastic pooling for regularization of deep convolutional neural networks. In: ICLR (2013)
Google Scholar
Zhao, T., Yu, M., Wang, Y., Arora, R., Liu, H.: Accelerated mini-batch randomized block coordinate descent method. In: Advances in Neural Information Processing Systems, pp. 3329–3337 (2014)
Google Scholar

Download references

Acknowledgement

This work is partially supported by China Postdoctoral Science Foundation Funded Project (2016M590337), NSFC (11501210), Shanghai YangFan Plan (15YF1403400), and Shanghai Science and Technology Committee Project (15JC1401700).

Author information

Authors and Affiliations

East China Normal University, Shanghai, China
Xin Liu, Junchi Yan, Xiangfeng Wang & Hongyuan Zha
IBM Research – China, Shanghai, China
Junchi Yan
Georgia Institute of Technology, Atlanta, USA
Hongyuan Zha

Authors

Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Junchi Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xiangfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyuan Zha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junchi Yan .

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China
Xuelong Li
Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China
Xilin Chen
Tsinghua University , Beijing, China
Jie Zhou
Nanjing University of Science and Technology, Nanjing, China
Jian Yang
University of Electronic Science and Technology, Chengdu, Sichuan, China
Hong Cheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X., Yan, J., Wang, X., Zha, H. (2016). Parallel Randomized Block Coordinate Descent for Neural Probabilistic Language Model with High-Dimensional Output Targets. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_28

Download citation

DOI: https://doi.org/10.1007/978-981-10-3005-5_28
Published: 22 October 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3004-8
Online ISBN: 978-981-10-3005-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics