Vocabulary Learning Support System Based on Automatic Image Captioning Technology

  • Mohammad Nehal HasnineEmail author
  • Brendan Flanagan
  • Gokhan Akcapinar
  • Hiroaki Ogata
  • Kousuke Mouri
  • Noriko Uosaki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11587)


Learning context has evident to be an essential part in vocabulary development, however describing learning context for each vocabulary is considered to be difficult. In the human brain, it is relatively easy to describe learning contexts using pictures because pictures describe an immense amount of details at a quick glance that text annotations cannot do. Therefore, in an informal language learning system, pictures can be used to overcome the problems that language learners face in describing learning contexts. The present study aimed to develop a support system that generates and represents learning contexts automatically by analyzing the visual contents of the pictures captured by language learners. Automatic image captioning, a technology of artificial intelligence that connects computer vision and natural language processing is used for analyzing the visual contents of the learners’ captured images. A neural image caption generator model called Show and Tell is trained for image-to-word generation and to describe the context of an image. The three-fold objectives of this research are: First, an intelligent technology that can understand the contents of the picture and capable to generate learning contexts automatically; Second, a leaner can learn multiple vocabularies by using one picture without relying on a representative picture for each vocabulary, and Third, a learner’s prior vocabulary knowledge can be mapped with new learning vocabulary so that previously acquired vocabulary be reviewed and recalled while learning new vocabulary.


Artificial intelligence in education Automatic image captioning Learning context representation Ubiquitous learning Visual contents analysis Vocabulary learning 



This work was partly supported by JSPS Grant-in-Aid for Scientific Research (S)16H06304 and 17K12947; NEDO Special Innovation Program on AI and Big Data 18102059-0; and JSPS Start-up Grant-in-Aid Number 18H05745.


  1. 1.
    Stockwell, G.: A review of technology choice for teaching language skills and areas in the CALL literature. ReCALL 19(2), 105–120 (2007)CrossRefGoogle Scholar
  2. 2.
    Yeh, Y., Wang, C.: Effects of multimedia vocabulary annotations and learning styles on vocabulary learning. Calico J. 21(1), 131–144 (2003)CrossRefGoogle Scholar
  3. 3.
    Heift, T.: Error-specific and individualised feedback in a Web-based language tutoring system: do they read it? ReCALL 13(1), 99–109 (2001)CrossRefGoogle Scholar
  4. 4.
    Coll, J.F.: Richness of semantic encoding in a hypermedia-assisted instructional environment for ESP: effects on incidental vocabulary retention among learners with low ability in the target language. ReCALL 14(2), 263–284 (2002)CrossRefGoogle Scholar
  5. 5.
    Gu, P.Y.: Vocabulary learning in a second language: person, task, context and strategies. TESL-EJ 7(2), 1–25 (2003)Google Scholar
  6. 6.
    Ibrahim, W.J.: The importance of contextual situation in language teaching. Adab AL Rafidayn 51, 630–655 (2008)Google Scholar
  7. 7.
    Sternberg, R.J.: Most vocabulary is learned from context. Nat. Vocab. Acquis. 89, 105 (1987)Google Scholar
  8. 8.
    Nagy, W.E.: On the role of context in first- and second-language vocabulary learning. University of Illinois at Urbana-Champaign, Center for the Study of Reading, Champaign (1995)Google Scholar
  9. 9.
    Daniel, R.P.R.: Caption this, with TensorFlow. O’Reilly Media, 28 March 2017.
  10. 10.
    Liu, C., Wang, C., Sun, F., Rui, Y.: Image2Text: a multimodal caption generator. In: ACM Multimedia, pp. 746–748 (2016)Google Scholar
  11. 11.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 312–83137 (2015)Google Scholar
  12. 12.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)Google Scholar
  14. 14.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  15. 15.
    Image Processing with the Computer Vision API—Microsoft Azure.
  16. 16.
    Ogata, H., Uosaki, N., Mouri, K., Hasnine, M.N., Abou-Khalil, V., Flanagan, B.: SCROLL Dataset in the context of ubiquitous language learning. In: Workshop Proceedings of the 26th International Conference on Computer in Education, Manila, Philippines, pp. 418–423 (2018)Google Scholar
  17. 17.
    Hasnine, M.N., Mouri, K., Flanagan, B., Akcapinar, G., Uosaki, N., Ogata, H.: Image recommendation for informal vocabulary learning in a context-aware learning environment. In: Proceedings of the 26th International Conference on Computer in Education, Manila, Philippines, pp. 669–674 (2018)Google Scholar
  18. 18.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)CrossRefGoogle Scholar
  19. 19.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  20. 20.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mohammad Nehal Hasnine
    • 1
    Email author
  • Brendan Flanagan
    • 1
  • Gokhan Akcapinar
    • 4
  • Hiroaki Ogata
    • 1
  • Kousuke Mouri
    • 2
  • Noriko Uosaki
    • 3
  1. 1.Kyoto UniversityKyotoJapan
  2. 2.Tokyo University of Agriculture and TechnologyTokyoJapan
  3. 3.Osaka UniversityOsakaJapan
  4. 4.Hacettepe UniversityAnkaraTurkey

Personalised recommendations