An Optimized Regularization Method to Enhance Low-Resource MT

  • Yatu Ji
  • Hongxu HouEmail author
  • Ying Lei
  • Zhong Ren
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 931)


Overfitting caused by scarce parallel corpus is a serious problem in low-resource machine translation task, resulting in the weak generalization ability of translation models. Dropout and Dropconnect can address this issue by reducing training neurons or weights randomly with increasing the generalization ability. In this paper, we optimize Dropconnect by adopting Gaussian approximation in the Bernoulli distribution in low-resource machine translation tasks, and make an integration to alleviate the uneven sampling effect in Dropout and Dropconnect, especially the inadequate training problem. It is an effective approach to approximate mask calculations to linear operations while being fully trained. An interesting finding is that the adhesive language is more sensitive to our regular methods. Our approach outperforms the Dropout and Dropconnect for low-resource translation tasks.


Low-resource machine translation Over-fitting Uneven sampling Regularization method 



We thank PDCAT-18 reviewers. This work is supported by Natural Science Foundation of Inner Mongolia (No. 2018MS06005), Mongolian Language Information Special Support Project of Inner Mongolia (No. MW-2018-MGYWXXH-302) and the Postgraduate Scientific Research Innovation Foundation of Inner Mongolia (No. 10000-16010109-14).


  1. 1.
    Liu, Y.: Advances in neural machine translation. J. Comput. Res. Dev. 54(6), 1144–1149 (2017)Google Scholar
  2. 2.
    Lü, G., Luo, S., Huang, Y., et al.: A novel regularization method based on convolution neural network. J. Comput. Res. Dev. 51(9), 1891–1900 (2014)Google Scholar
  3. 3.
    Srivastava, N., Hinton, G., Krizhevsky, A., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Mackay, D.J.C.: Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Netw. Comput. Neural Syst. 6(3), 469–505 (1995)CrossRefGoogle Scholar
  5. 5.
    Wan, L., Zeiler, M., Zhang, S., et al.: Regularization of neural networks using dropconnect. In: International Conference on Machine Learning, pp. 1058–1066 (2013)Google Scholar
  6. 6.
    Zhou, W.H., Wang, A.H.: Discrete-time queue with Bernoulli bursty source arrival and generally distributed service times. Appl. Math. Model. 32(11), 2233–2240 (2008)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on International Conference on Machine Learning, pp. 1050–1059. (2016)Google Scholar
  8. 8.
    Shekhar, S., Xiong, H.: Model generalization. Encyclopedia of Gis, 682 (2013)Google Scholar
  9. 9.
    Xu, P., Jelinek, F.: Random forests and the data sparseness problem in language modeling. Comput. Speech Lang. 21(1), 105–152 (2007)CrossRefGoogle Scholar
  10. 10.
    Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRefGoogle Scholar
  11. 11.
    Mitchell, T., Buchanan, B., Dejong, G., et al.: Machine Learning. McGraw-Hill, New York (2003)Google Scholar
  12. 12.
    Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate [J]. Computer Science, 2014Google Scholar
  13. 13.
    Bianchini, M., Scarselli, F.: On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures [J]. IEEE Transactions on Neural Networks & Learning Systems 25(8), 1553–1565 (2014)CrossRefGoogle Scholar
  14. 14.
    Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. Adv. Neural. Inf. Process. Syst. 3, 2177–2185 (2014)Google Scholar
  15. 15.
    Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8609–8613. IEEE (2013)Google Scholar
  16. 16.
    Ozonat, K.M., Gray, R.M.: Fast Gauss mixture image classification based on the central limit theorem. In: 2004 IEEE Workshop on Multimedia Signal Processing, pp. 446–449. IEEE (2005)Google Scholar
  17. 17.
    Kline, D.M., Berardi, V.L.: Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput. Appl. 14(4), 310–318 (2005)CrossRefGoogle Scholar
  18. 18.
    Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluation the role of bleu in machine translation research. In: Proceedings of the Conference Eacl 2006, Conference of the European Chapter of the Association for Computational Linguistics, 3–7 Apr 2006, Trento, Italy, pp. 249–256. DBLP (2006)Google Scholar
  19. 19.
    Cho, K., Van Merrienboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. Comput. Sci. (2014)Google Scholar
  20. 20.
    Dey, R., Salem, F.M.: Gate-variants of gated recurrent unit (LSTM) neural networks (2017)Google Scholar
  21. 21.
    Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need (2017)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.College of Computer Science InnerMongolia UniversityHohhotChina

Personalised recommendations