Phrase-Level Grouping for Lexical Gap Resolution in Korean-Vietnamese SMT

  • Seung Woo ChoEmail author
  • Eui-Hyeon Lee
  • Jong-Hyeok Lee
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 781)


A lexical gap easily leads to word alignment errors, which impairs a translation quality. This paper proposes some simple ideas to resolve the difficulty of handling the lexical gap. In morphologically rich languages, a predicate has a complex structure consisting of many morphemes, so we mainly address the issue of how to group the component morphemes by employing morpho-syntactic filters and statistical information from the SMT phrase table. In addition, we abstract grouping results depending on a lexical choice of the target side to enhance translation probabilities. In the experiment, we not only investigate how each method has an effect on Korean-to-Vietnamese SMT, but also show a promising improvement of BLEU score.


Statistical machine translation Lexical gap resolution Morpheme group Multi-word expression Korean-Vietnamese translation 



This work was partly supported by the ICT R&D program of MSIP/IITP [R7119-16-1001, Core technology development of the real-time simultaneous speech translation based on knowledge enhancement], the ICT Consilience Creative Program of MSIP/IITP [R0346-16-1007] and SYSTRAN.


  1. 1.
    Baldwin, T., Kim, S.N.: Multiword expressions. In: Handbook of Natural Language Processing, 2nd edn., pp. 267–292. Chapman and Hall/CRC (2010)Google Scholar
  2. 2.
    Bentivogli, L., Pianta, E.: Looking for lexical gaps. In: Proceedings of the ninth EURALEX International Congress, pp. 8–12. Universität Stuttgart, Stuttgart (2000)Google Scholar
  3. 3.
    Bouamor, D., Semmar, N., Zweigenbaum, P.: A study in using English-Arabic multi-word expressions for statistical machine translation. In: 4th International Conference on Arabic Language Processing (2012)Google Scholar
  4. 4.
    Dien, D., Thuy, V.: A maximum entropy approach for vietnamese word segmentation. In: Proceedings of 4th IEEE International Conference on Computer Science-Research, Innovation and Vision of the Future 2006 (RIVFóÀ\(\tilde{\hat{\rm E}}\)06), pp. 12–16 (2006)Google Scholar
  5. 5.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)Google Scholar
  6. 6.
    El-Kahlout, I.D., Oflazer, K.: Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation. IEEE Trans. Audio Speech Lang. Process. 18(6), 1313–1322 (2010)CrossRefGoogle Scholar
  7. 7.
    Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)Google Scholar
  8. 8.
    Koehn, P.: Statistical significance tests for machine translation evaluation. In: EMNLP, pp. 388–395. Citeseer (2004)Google Scholar
  9. 9.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)Google Scholar
  10. 10.
    Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003)Google Scholar
  11. 11.
    Lambert, P., Banchs, R.: Grouping multi-word expressions according to part-of-speech in statistical machine translation. Multi-word-expressions in a multilingual context, p. 9 (2006)Google Scholar
  12. 12.
    Lee, J., Lee, D., Lee, G.G.: Improving phrase-based Korean-English statistical machine translation. In: INTERSPEECH (2006)Google Scholar
  13. 13.
    Li, S., Wong, D.F., Chao, L.S.: Korean-Chinese statistical translation model. In: 2012 International Conference on Machine Learning and Cybernetics (ICMLC), vol. 2, pp. 767–772. IEEE (2012)Google Scholar
  14. 14.
    Nghiem, M., Dinh, D., Nguyen, M.: Improving Vietnamese pos tagging by integrating a rich feature set and Support Vector Machines. In: IEEE International Conference on Research, Innovation and Vision for the Future, RIVF 2008, pp. 128–133. IEEE (2008)Google Scholar
  15. 15.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRefzbMATHGoogle Scholar
  16. 16.
    Oh-Woog, K., Yujin, C., Mi-Young, K., Dong-Won, R., Moon-Ki, L., Jong-Hyeok, L.: Korean morphological analyzer and part-of-speech tagger based on cyb algorithm using syllable information. In: Proceedings of the 11th Annual Conference on Human and Cognitive Language Technology, pp. 76–87 (1999)Google Scholar
  17. 17.
    Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, pp. 47–54. Association for Computational Linguistics (2009)Google Scholar
  18. 18.
    Sakata, J., Tokuhisa, M., Murata, M., et al.: Machine translation method based on non-compositional semantics (word-level sentence-pattern-based MT). In: Hasida, K., Purwarianti, A. (eds.) International Conference of the Pacific Association for Computational Linguistics, pp. 225–237. Springer, Singapore (2015). Google Scholar
  19. 19.
    Shin, S.: Corpus-based study of word order variations in Korean. In: Proceedings of the Corpus Linguistics Conference (CL 2007), pp. 27–30. Citeseer (2007)Google Scholar
  20. 20.
    Skadina, I., Rozis, R.: Multi-word expressions in English-Latvian. In: Human Language Technologies-The Baltic Perspective: Proceedings of the Seventh International Conference Baltic HLT 2016, vol. 289, p. 97. IOS Press (2016)Google Scholar
  21. 21.
    Todiraşcu, A., Navlea, M.: Aligning verb+ noun collocations to improve a French-Romanian FSMT system. In: Multi-word units in Machine Translation and Translation Technologies, MUMTTT 2015, p. 37 (2015)Google Scholar
  22. 22.
    Tran, P., Dinh, D., Nguyen, L.H.: Word re-segmentation in chinese-vietnamese machine translation. ACM Trans. Asian Low-Res. Lang. Inf. Process. (TALLIP) 16(2), 12 (2016)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringPohang University of Science and TechnologyPohangRepublic of Korea

Personalised recommendations