An End-to-End Method for Data Filtering on Tibetan-Chinese Parallel Corpus via Negative Sampling

  • Sangjie Duanzhu
  • Cizhen Jiacuo
  • Rou Te
  • Sanzhi Jia
  • Cairang JiaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11856)


In the field of machine translation, parallel corpus serves as the most important prerequisite for learning complex mappings between targeted language pairs. However, in practice, the scale of parallel corpus is not necessarily the only factor to be taken into consideration for improving performance of translation models due to the quality of parallel data itself also has tremendous impact on model capacity. In recent years, neural machine translation systems have become the de facto choice of implementation in MT research, but they are more vulnerable to noisy disturbance presented in training data compared with traditional statistical machine translation models. Therefore, data filtering is an indispensable procedure in NMT pre-processing pipeline. Instead of utilizing discrete feature representations of basic language units to build a ranking function of given sentence pairs, in this work, we proposed a fully end-to-end parallel sentence classifier to estimate the probability of given sentence pairs being equivalent translation for each other. Our model was tested in three scenarios, namely, classification, sentence extraction and NMT data filtering tasks. All testing experiments showed promising results, and especially in Tibetan-Chinese NMT experiments, 3.7 BLEU boost was observed after applying our data filtering method, indicating the effectiveness of our model.


Tibetan-Chinese Data filtering Neural machine translation 


  1. 1.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)Google Scholar
  2. 2.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv abs/1810.04805 (2018)Google Scholar
  3. 3.
    Foster, G.F., Goutte, C., Kuhn, R.: Discriminative instance weighting for domain adaptation in statistical machine translation. In: EMNLP (2010)Google Scholar
  4. 4.
    Hassan, H., et al.: Achieving human parity on automatic Chinese to English news translation. arXiv abs/1803.05567 (2018)Google Scholar
  5. 5.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  6. 6.
    Kanai, S., Fujiwara, Y., Iwamura, S.: Preventing gradient explosions in gated recurrent units. In: NIPS (2017)Google Scholar
  7. 7.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv abs/1412.6980 (2014)Google Scholar
  8. 8.
    Moore, R.C., Lewis, W.D.: Intelligent selection of language model training data. In: ACL (2010)Google Scholar
  9. 9.
    Sangjie, D., Cairang, J.: A study on neural network based tibetan word segmentation method. Qinghai Technol. 25, 15–21 (2018). (in Chinese)Google Scholar
  10. 10.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for SIGHAN bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)Google Scholar
  12. 12.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Sangjie Duanzhu
    • 1
  • Cizhen Jiacuo
    • 1
  • Rou Te
    • 1
  • Sanzhi Jia
    • 1
  • Cairang Jia
    • 1
    Email author
  1. 1.Key Laboratory of Tibetan Information Processing and Machine TranslationQinghai Normal UniversityXiningChina

Personalised recommendations