Advertisement

Visual Text Correction

  • Amir MazaheriEmail author
  • Mubarak ShahEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11217)

Abstract

This paper introduces a new problem, called Visual Text Correction (VTC), i.e., finding and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence, and fix it by replacing the inaccurate word(s). Our method leverages the semantic interdependence of videos and words, as well as the short-term and long-term relations of the words in a sentence. Our proposed formulation can solve the VTC problem employing an End-to-End network in two steps: (1) Inaccuracy detection, and (2) correct word prediction. In detection step, each word of a sentence is reconstructed such that the reconstruction for the inaccurate word is maximized. We exploit both Short Term and Long Term Dependencies employing respectively Convolutional N-Grams and LSTMs to reconstruct the word vectors. For the correction step, the basic idea is to simply substitute the word with the maximum reconstruction error for a better one. The second step is essentially a classification problem where the classes are the words in the dictionary as replacement options. Furthermore, to train and evaluate our model, we propose an approach to automatically construct a large dataset for the VTC problem. Our experiments and performance analysis demonstrates that the proposed method provides very good results and also highlights the general challenges in solving the VTC problem. To the best of our knowledge, this work is the first of its kind for the Visual Text Correction task.

Notes

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 1741431. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  2. 2.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  3. 3.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  4. 4.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  5. 5.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  6. 6.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  7. 7.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  8. 8.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  9. 9.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)CrossRefGoogle Scholar
  10. 10.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  11. 11.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
  12. 12.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: CVPR (2015)Google Scholar
  13. 13.
    Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122 (2017)
  14. 14.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Association for Computational Linguistics, pp. 190–200 (2011)Google Scholar
  15. 15.
    Bordes, A., Usunier, N., Chopra, S., Weston, J.: Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 (2015)
  16. 16.
    Kumar, A., Irsoy, O., Su, J., Bradbury, J., English, R., Pierce, B., Ondruska, P., Gulrajani, I., Socher, R.: Ask me anything: dynamic memory networks for natural language processing. arXiv preprint arXiv:1506.07285 (2015)
  17. 17.
    Weston, J., Chopra, S., Bordes, A.: Memory networks. arXiv preprint arXiv:1410.3916 (2014)
  18. 18.
    Zhang, H., Chiang, D.: Kneser-Ney smoothing on expected countsGoogle Scholar
  19. 19.
    Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 310–318 (1996)Google Scholar
  20. 20.
    Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)
  21. 21.
    Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083 (2016)
  22. 22.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  23. 23.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)Google Scholar
  24. 24.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014)Google Scholar
  25. 25.
    Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 (2015)
  26. 26.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR (2015)Google Scholar
  27. 27.
    Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. arXiv preprint arXiv:1511.07571 (2015)
  28. 28.
    Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3165–3173 (2017)Google Scholar
  29. 29.
    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: VQA: Visual question answering. In: ICCV (2015)Google Scholar
  30. 30.
    Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS (2015)Google Scholar
  31. 31.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS (2014)Google Scholar
  32. 32.
    Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C.L., Batra, D., Parikh, D.: Vqa: Visual question answering. arXiv preprint arXiv:1505.00468 (2015)
  33. 33.
    Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. arXiv preprint arXiv:1603.01417 (2016)
  34. 34.
    Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and yang: Balancing and answering binary visual questions. arXiv preprint arXiv:1511.05099 (2015)
  35. 35.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: Movieqa: Understanding stories in movies through question-answering. In: CVPR (2016)Google Scholar
  36. 36.
    Nadeem, F., Ostendorf, M.: Language based mapping of science assessment items to skills. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 319–326 (2017)Google Scholar
  37. 37.
    Sadeghi, F., Divvala, S.K., Farhadi, A.: VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases. In: CVPR (2015)Google Scholar
  38. 38.
    Maharaj, T., Ballas, N., Courville, A., Pal, C.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. arXiv preprint arXiv:1611.07810 (2016)
  39. 39.
    Mazaheri, A., Zhang, D., Shah, M.: Video fill in the blank using LR/RL LSTMS with spatial-temporal attentions. In: ICCV 2017 (2017)Google Scholar
  40. 40.
    Mays, E., Damerau, F.J., Mercer, R.L.: Context based spelling correction. Inf. Process. Manag. 27(5), 517–522 (1991)CrossRefGoogle Scholar
  41. 41.
    Wu, C.H., Liu, C.H., Harris, M., Yu, L.C.: Sentence correction incorporating relative position and parse template language models. IEEE Trans. Audio Speech Lang. Process. 18(6), 1170–1181 (2010)CrossRefGoogle Scholar
  42. 42.
    Wagner, R.A.: Order-N correction for regular languages. Commun. ACM 17(5), 265–268 (1974)CrossRefGoogle Scholar
  43. 43.
    Suhm, B., Myers, B., Waibel, A.: Multimodal error correction for speech user interfaces. ACM Trans. Comput. Hum. Interact. 8(1), 60–98 (2001)CrossRefGoogle Scholar
  44. 44.
    Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR (2015)Google Scholar
  45. 45.
    Loper, E., Bird, S.: NLTK: The natural language toolkit. In: Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, Association for Computational Linguistics, pp. 63–70 (2002)Google Scholar
  46. 46.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Center for Research in Computer VisionUniversity of Central FloridaOrlandoUSA

Personalised recommendations