Skip to main content

Research on Domain Adaptation for SMT Based on Specific Domain Knowledge

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 668))

Abstract

In statistical machine translation, training data usually have the characteristics of diverse sources, multiple themes, different genre, and are often not in accordance with the domain of target text to be translated, resulting in domain adaptive problem. The existing adaptive methods for statistical machine translation aim for the target text and focus on the selection of training data and the adjustment of translation models. These approaches have not specified explicit domain labels for texts or data. This study gives explicit domain labels and uses two examples for specific context knowledge, (1) Domain knowledge based on Chinese Thesaurus are applied to assign domain labels of Chinese Library Classification Number to Chinese texts; (2) Two-dimensional lexicalized domain knowledge, such as Semantic Category and Application Scenarios, is used to label Japanese sentence. Based on the obtained domain labels for development data and test data, the training data can be filtered to achieve the goal of domain consistency. Experiments show that only a part of the training data can gain a comparable translation performance to the whole training data. This shows that the method is efficient and feasible.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN.

    .

References

  1. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-volume, North American, pp. 127–133 (2003)

    Google Scholar 

  2. Lei, C., Ming, Z.: An overview of domain adaptation for statistical machine translation. Intell. Comput. Appl. 4(6), 31–34 (2014)

    Google Scholar 

  3. Zeng, J., Chang, C.: Function orientation and development of new edition of chinese thesaurus under network environment. J. China Soc. Sci. Tech. Inf. 29(6), 973–977 (2010)

    Google Scholar 

  4. Chinese Thesaurus. Scientific and Technical Documentation Press (1991)

    Google Scholar 

  5. Shunian, C.: The first electronic edition of Chinese library classification. Lib. Inf. Serv. 3, 55–60 (2002)

    Google Scholar 

  6. Eck, M., Vogel, S., Waibel, A.: Low cost portability for statistical machine translation based on n-gram coverage. In: Proceedings of Mtsummit X (2005)

    Google Scholar 

  7. Zhao, B., Eck, M., Vogel, S.: Language model adaptation for statistical machine translation with structured query models. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 411. Association for Computational Linguistics, The University of Geneva, Switzerland (2004)

    Google Scholar 

  8. Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 28–30 June 2007, Prague, Czech Republic, pp. 343–350 (2007)

    Google Scholar 

  9. Matsoukas, S., Rosti, A., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, vol. 2, pp. 708–717. Association for Computational Linguistics, Singapore (2009)

    Google Scholar 

  10. Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: ACL 2010, Proceedings of the, Meeting of the Association for Computational Linguistics, 11–16 July 2010, Uppsala, Sweden, Short Papers, pp. 220–224 (2010)

    Google Scholar 

  11. Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics, Edinburgh, UK (2011)

    Google Scholar 

  12. Shujie, Y., Tong, X., Jingbo, Z.: Selectiion of SMT training data based on sentence pair quality and coverage. J. Chin. Inf. Process. 25(2), 72–77 (2011)

    Google Scholar 

  13. Foster, G., Kuhn, R.: Mixture model adaptation for SMT. In: Proceedings of Second Workshop on Statistical Machine Translation, pp. 128–135. Association for Computational Linguistics, Prague (2007)

    Google Scholar 

  14. Civera, J., Juan, A.: Domain adaptation in statistical machine translation with mixture modeling. In: Proceedings of the Second workshop Statistical Machine Translation, pp. 177–180. Association for Computational Linguistics, Prague (2007)

    Google Scholar 

  15. Koehn, P., Schroeder, J.: Experiments in domain adaptation for statistical machine translation. In: Proceedings of the Second, Workshop on Statistical Machine Translation, pp. 224–227. Association for Computational Linguistics, Prague (2007)

    Google Scholar 

  16. Finch, A., Sumita, E.: Dynamic model interpolation for statistical machine translation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 208–215. Association for Computational Linguistics, Columbus (2008)

    Google Scholar 

  17. Foster, G., Goutte, C., Kuhn, R.: Discriminative instance weighting for domain adaptation in statistical machine translation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 451–459. Association for Computational Linguistics, Cambridge (2010)

    Google Scholar 

  18. Banerjee, P., Naskar, S.K., Roturier, J., et al.: Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In: Proceedings of Machine Translation Summit XIII, Xiamen, China, pp. 285–292 (2011)

    Google Scholar 

  19. Sennrich, R.: Perplexity minimization for translation model domain adaptation in statistical machine translation. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 539–549. Association for Computational Linguistics, Avignon (2012)

    Google Scholar 

  20. Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th ACL: Shortpapers, pp. 407–412. Association for Computational Linguistics, Portland (2011)

    Google Scholar 

  21. Ueffing, N., Haffari, G., Sarkar, A.: Semi-supervised model adaptation for statistical machine translation. Mach. Transl. 21, 71–94 (2007)

    Article  Google Scholar 

  22. Wu, H.,Wang, H.,Zong, C.: Domain adaptation for statistical machine translation with domain dictionary and monolingual corploa. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 993–1000. COLING 2008 Organizing Committee, Manchester (2008)

    Google Scholar 

  23. Schwenk, H.: Investigations on large-scale lightly supervised training for statistical machine translation. In: Proceedings of the International Workshop on Spoken Language Translation, pp. 182–189. IWSLT, Hawaii (2008)

    Google Scholar 

  24. Zhao, B., Xing, E.P.: BiTAM:Bilingual topic admixture models for word alignment. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 969–976. Association for Computational Linguistics, Sydney (2006)

    Google Scholar 

  25. Zhao, B., Xing, E.P.: HM-BiTAM: Bilingual topic exploration, word alignment, and translation. In: Advances in Neural Information Processing Systems, pp. 1689–1696. Vancouver, British Columbia (2008)

    Google Scholar 

  26. Tam, Y.C., Lane, I., Schultz, T.: Bilingual LSA-based adaptation for statistical machine translation.Mach. Transl. 2l(4), 187–207 (2007)

    Google Scholar 

  27. Su, J.,Wu, H., Wang, H., et a1.: Translation model adaptation for statistical machine translation with monolingual topic information. In: Proceedings of Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 459–468. Association for Computational Linguistics, Jeju (2012)

    Google Scholar 

  28. Xiao, X., Xiong, D., Zhang, M., et a1.: A topic similarity model for hierarchical phrase-based translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 750–758. Association for Computational Linguistics, Jeju (2012)

    Google Scholar 

  29. Ding, L., Li, Y., He, Y., Wang, X., Zhang, Y., Yao, C.: Experimental study on training data selection of SMT based on chinese thesaurus. J. China Soc. Sci. Tech. Inf. (accepted)

    Google Scholar 

  30. Ding, L., Li, Y., He, Y., Liu, J.: Research on Japanese-Chinese S&T terminology translation based-on two-dimensional domain lexicalized domain knowledge. In: CWMT 2016, Urumchi, China, vol. 8, pp. 25–26 (2016)

    Google Scholar 

  31. Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Meeting on Association for Computational Linguistics, pp. 295–302. Association for Computational Linguistics, Stroudsburg, USA (2002)

    Google Scholar 

  32. Xiong, D., Liu, Q., Lin, S.: Maximum entropy based phrase reordering model for statistical machine translation. In: Proceedings of COLING-ACL, Sydney, Australia, pp. 521–528 (2006)

    Google Scholar 

  33. Xiao, T., Zhu, J., Zhang, H., et al.: NiuTrans: an open source toolkit for phrase-based and syntax-based machine translation. In: ACL 2012 System Demonstrations, Jeju, Republic of Korea, pp. 19–24 (2012)

    Google Scholar 

  34. Hashimoto, C., Kurohashi, S.: Construction of domain dictionary for fundamental vocabulary and its application to automatic blog categorization with the dynamic estimation of unknown words’ domains. J. Nat. Lang. Process. 15(5), 73–97 (2008)

    Article  Google Scholar 

  35. Kurohashi, S., Nakamura, T., Matsumoto, Y., et al.: Improvements of Japanese morphological analyzer JUMAN. In: Proceedings of The International Workshop on Sharable Natural Language, pp. 22–28 (1994)

    Google Scholar 

Download references

Acknowledgments

This research work was partially supported by National Natural Science of China (61303152, 71503240), and ISTIC Research Foundation Projects (ZD2016-05).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

He, Y., Ding, L., Li, Y. (2016). Research on Domain Adaptation for SMT Based on Specific Domain Knowledge. In: Yang, M., Liu, S. (eds) Machine Translation. CWMT 2016. Communications in Computer and Information Science, vol 668. Springer, Singapore. https://doi.org/10.1007/978-981-10-3635-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3635-4_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3634-7

  • Online ISBN: 978-981-10-3635-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics