Advertisement

Text-Image Alignment in Portuguese News Using LinkPICS

  • Wellington Cristiano Veltroni
  • Helena de Medeiros Caseli
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11122)

Abstract

Text-image alignment is the task of aligning elements in a text with elements in the image accompanying it. Text-image alignment can be applied, for example, in news articles to improve clarity by explicitly defining the correspondence between regions in the article’s image and words or named entities in the article’s text. It can also be an useful step in many multimodal applications such as image captioning or image description/comprehension. In this paper we present the LinkPICS: an automatic aligner which combines Natural Language Processing (NLP) and Computer Vision (CV) techniques to explicitly define the correspondence between regions of an image (bounding boxes) and elements (words or named entities) in a text. LinkPICS performs the alignment of people and objects (or animals, vehicles, etc.) as two distinct processes. In the experiments present in this paper, LinkPICS obtained a precision of 97% in the alignment of people and 73% in the alignment of objects in articles in Portuguese from a Brazilian news site.

Keywords

Text-image alignment Aligner LinkPICS Brazilian Portuguese Alignment of people Alignment of objects 

Notes

Acknowledgments

This research is part of the MMeaning project, supported by São Paulo Research Foundation (FAPESP), grant #2016/13002-0, and was also partly funded by the Brazilian Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).

References

  1. 1.
    Al-Rfou, R., Kulkarni, V., Perozzi, B., Skiena, S.: POLYGLOT-NER: massive multilingual named entity recognition. In: Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 586–594. SIAM (2015)CrossRefGoogle Scholar
  2. 2.
    Berg, T.L., et al.: Names and faces in the news. In: 2004 Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, vol. 2, p. II-848. IEEE (2004)Google Scholar
  3. 3.
    Choi, D., Kim, P.: Automatic image annotation using semantic text analysis. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 479–487. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-32498-7_36CrossRefGoogle Scholar
  4. 4.
    Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
  5. 5.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1–38 (1977)Google Scholar
  6. 6.
    Deschacht, K., Moens, M.F.: Text analysis for automatic image annotation. ACL 7, 1000–1007 (2007)Google Scholar
  7. 7.
    Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-47979-1_7CrossRefGoogle Scholar
  8. 8.
    Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  9. 9.
    Fischer, A., Frinken, V., Fornés, A., Bunke, H.: Transcription alignment of Latin manuscripts using hidden Markov models. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp. 29–36. ACM (2011)Google Scholar
  10. 10.
    Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
  11. 11.
    Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report 07–49, University of Massachusetts, Amherst (2007)Google Scholar
  12. 12.
    Leydier, Y., Eglin, V., Bres, S., Stutzmann, D.: Learning-free text-image alignment for medieval manuscripts. In: 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 363–368. IEEE (2014)Google Scholar
  13. 13.
    Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 2036–2043. IEEE (2009)Google Scholar
  14. 14.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  15. 15.
    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, pp. 11–20 (2016)Google Scholar
  16. 16.
    Noel, G.E., Peterson, G.L.: Context-driven image annotation using ImageNet. In: The Twenty-Sixth International FLAIRS Conference (2013)Google Scholar
  17. 17.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRefGoogle Scholar
  18. 18.
    de Paiva, V., Rademaker, A., de Melo, G.: OpenWordNet-PT: an open Brazilian Wordnet for reasoning. In: Proceedings of COLING 2012: Demonstration Papers, The COLING 2012 Organizing Committee, Mumbai, India, pp. 353–360, December 2012. http://www.aclweb.org/anthology/C12-3044. Published also as Technical report http://hdl.handle.net/10438/10274
  19. 19.
    Pham, P., Moens, M.F., Tuytelaars, T.: Linking names and faces: seeing the problem in different ways. In: Proceedings of the 10th European Conference on Computer Vision: Workshop Faces In‘real-life’images: Detection, Alignment, and Recognition, pp. 68–81 (2008)Google Scholar
  20. 20.
    Ramisa, A., Yan, F., Moreno-Noguer, F., Mikolajczyk, K.: Breakingnews: article annotation by image and text processing. arXiv preprint arXiv:1603.07141 (2016)
  21. 21.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  22. 22.
    Schmid, H.: Probabilistic part-ofispeech tagging using decision trees. In: New methods in Language Processing, p. 154. Routledge (2013)Google Scholar
  23. 23.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, pp. 815–823 (2015)Google Scholar
  24. 24.
    Socher, R., Fei-Fei, L.: Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 966–973. IEEE (2010)Google Scholar
  25. 25.
    Socher, R., Fei-Fei, L.: Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 966–973. IEEE (2010)Google Scholar
  26. 26.
    Tegen, A., et al.: Image segmentation and labeling using free-form semantic annotation. In: ICPR, pp. 2281–2286 (2014)Google Scholar
  27. 27.
    Tirilly, P., Claveau, V., Gros, P., et al.: News image annotation on a large parallel text-image corpus. In: LREC (2010)Google Scholar
  28. 28.
    Tiwari, P., Kamde, P.: Automatic image annotation and retrieval using contextual information (2015)Google Scholar
  29. 29.
    Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138. Association for Computational Linguistics (1994)Google Scholar
  30. 30.
    Yin, F., Wang, Q.F., Liu, C.L.: Transcript mapping for handwritten chinese documents by integrating character recognition model and geometric context. Pattern Recogn. 46(10), 2807–2818 (2013)CrossRefGoogle Scholar
  31. 31.
    Zinger, S., Nerbonne, J., Schomaker, L.: Text-image alignment for historical handwritten documents. In: IS&T/SPIE Electronic Imaging, pp. 724703–724703. International Society for Optics and Photonics (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Universidade Federal de São Carlos (UFSCar)São CarlosBrazil

Personalised recommendations