Advertisement

The Encoder-Decoder Framework and Its Applications

  • Ahmad Asadi
  • Reza SafabakhshEmail author
Chapter
Part of the Studies in Computational Intelligence book series (SCI, volume 866)

Abstract

The neural encoder-decoder framework has advanced the state-of-the-art in machine translation significantly. Many researchers in recent years have employed the encoder-decoder based models to solve sophisticated tasks such as image/video captioning, textual/visual question answering, and text summarization. In this work we study the baseline encoder-decoder framework in machine translation and take a brief look at the encoder structures proposed to cope with the difficulties of feature extraction. Furthermore, an empirical study of solutions to enable decoders to generate richer fine-grained output sentences is provided. Finally, the attention mechanism which is a technique to cope with long-term dependencies and to improve the encoder-decoder performance on sophisticated tasks is studied.

Keywords

Encoder-decoder framework Machine translation Image captioning Video caption generation Question answering Long-term dependencies Attention mechanism 

References

  1. 1.
    Lopez, A.: Statistical machine translation. ACM Comput. Surv. 40(3), 8 (2008)CrossRefGoogle Scholar
  2. 2.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). http://arxiv.org/abs/1406.1078
  3. 3.
    Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches (2014). http://arxiv.org/abs/1409.1259
  4. 4.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks (2014). http://arxiv.org/abs/1412.4729
  5. 5.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  6. 6.
    Gehring, J., Auli, M., Grangier, D., Dauphin, Y.N.: A convolutional encoder model for neural machine translation (2016). http://arxiv.org/abs/1611.02344
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  8. 8.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). http://arxiv.org/abs/1409.0473
  9. 9.
    Luong, M.-T., Manning, C.D.: Stanford neural machine translation systems for spoken language domains. In: Proceedings of the International Workshop on Spoken Language Translation, pp. 76–79 (2015)Google Scholar
  10. 10.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  11. 11.
    Mi, H., Sankaran, B., Wang, Z., Ittycheriah, A.: Coverage embedding models for neural machine translation (2016). http://arxiv.org/abs/1605.03148
  12. 12.
    He, D., et al.: Dual learning for machine translation. In: Advances in Neural Information Processing Systems, pp. 820–828 (2016)Google Scholar
  13. 13.
    Tu, Z., Liu, Y., Lu, Z., Liu, X., Li, H.: Context gates for neural machine translation. Trans. Assoc. Comput. Linguist. 5, 87–99 (2017)CrossRefGoogle Scholar
  14. 14.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)Google Scholar
  15. 15.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). http://arxiv.org/abs/1409.1556
  17. 17.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255 (2009)Google Scholar
  18. 18.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)Google Scholar
  19. 19.
    Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298–6306 (2017)Google Scholar
  20. 20.
    Pedersoli, M., Lucas, T., Schmid, C., Verbeek, J.: Areas of attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1242–1250 (2017)Google Scholar
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  22. 22.
    Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242–3250 (2017)Google Scholar
  23. 23.
    Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)Google Scholar
  24. 24.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA (2017). http://arxiv.org/abs/1707.07998
  25. 25.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  26. 26.
    Zhang, L., et al.: Actor-critic sequence training for image captioning (2017). http://arxiv.org/abs/1706.09601
  27. 27.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). http://arxiv.org/abs/1502.03167
  28. 28.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  29. 29.
    Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of spider. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 873–881 (2017)Google Scholar
  30. 30.
    Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4904–4912 (2017)Google Scholar
  31. 31.
    Asadi, A., Safabakhsh, R.: A deep decoder structure based on word embedding regression for an encoder-decoder based model for image captioning. In: Submitted to Cognitive Computation (2019)Google Scholar
  32. 32.
    Lin, T.-Y., et al.: Microsoft COCO: Common Objects in Context. In: European Conference on Computer Vision, pp. 740–755 (2014)CrossRefGoogle Scholar
  33. 33.
    Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)Google Scholar
  34. 34.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  35. 35.
    Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004) (2004)Google Scholar
  36. 36.
    Majd, M., Safabakhsh, R.: Correlational convolutional LSTM for human action recognition. Neurocomputing (2019)Google Scholar
  37. 37.
    Majd, M., Safabakhsh, R.: A motion-aware ConvLSTM network for action recognition. Appl. Intell., 1–7 (2019)Google Scholar
  38. 38.
    Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)Google Scholar
  39. 39.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2016)Google Scholar
  40. 40.
    Li, Y., Yao, T., Pan, Y., Chao, H., Mei, T.: Jointly localizing and describing events for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7492–7500 (2018)Google Scholar
  41. 41.
    Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)Google Scholar
  42. 42.
    Escorcia, V., Heilbron, F.C., Niebles, J.C., Ghanem, B.: Daps: deep action proposals for action understanding. In: European Conference on Computer Vision, pp. 768–784 (2016)CrossRefGoogle Scholar
  43. 43.
    Shen, Z., et al.: Weakly supervised dense video captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5159–5167 (2017)Google Scholar
  44. 44.
    Duan, X., Huang, W., Gan, C., Wang, J., Zhu, W., Huang, J.: Weakly supervised dense event captioning in videos. In: Advances in Neural Information Processing Systems, pp. 3063–3073 (2018)Google Scholar
  45. 45.
    Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8739–8748 (2018)Google Scholar
  46. 46.
    Wang, J., Jiang, W., Ma, L., Liu, W., Xu, Y.: Bidirectional attentive fusion with context gating for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7190–7198 (2018)Google Scholar
  47. 47.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  48. 48.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11) (2011)Google Scholar
  49. 49.
    Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1, p. 334. MIT press Cambridge (2016)Google Scholar
  50. 50.
    Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation (2015). http://arxiv.org/abs/1508.04025
  51. 51.
    Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016). http://arxiv.org/abs/1609.08144
  52. 52.
    Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)CrossRefGoogle Scholar
  53. 53.
    Luong, M.-T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation (2014). http://arxiv.org/abs/1410.8206
  54. 54.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  55. 55.
    Gu, J., Cai, J., Wang, G., Chens, T.: Stack-captioning: coarse-to-fine learning for image captioning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  56. 56.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)Google Scholar
  57. 57.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)Google Scholar
  58. 58.
    Chen, H., Ding, G., Lin, Z., Zhao, S., Han, J.: Show, observe and tell: attribute-driven attention model for image captioning. In: IJCAI International Joint Conference on Artificial Intelligence (2018)Google Scholar
  59. 59.
    Ding, G., Chen, M., Zhao, S., Chen, H., Han, J., Liu, Q.: Neural image caption generation with weighted training and reference. Cogn. Comput. (2018)Google Scholar
  60. 60.
    Wang, X., Chen, W., Wu, J., Wang, Y.-F., Yang Wang, W.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4213–4222 (2018)Google Scholar
  61. 61.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)Google Scholar
  62. 62.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)Google Scholar
  63. 63.
    Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimed. 19(9), 2045–2055 (2017)CrossRefGoogle Scholar
  64. 64.
    Wu, C., Wei, Y., Chu, X., Weichen, S., Su, F., Wang, L.: Hierarchical attention-based multimodal fusion for video captioning. Neurocomputing 315, 362–370 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Computer Engineering and Information Technology DepartmentAmirkabir University of TechnologyTehranIran

Personalised recommendations