Skip to main content
Log in

Entity linking across vision and language

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We propose a novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language. Given a video with subtitles, we jointly address the questions: a) What do the textual entity mentions refer to? and b) What/ who are in the video key frames? We use a Markov Random Field (MRF) to encode the dependencies within and across the two modalities. This MRF model incorporates beliefs using independent methods for the textual and visual entities. These beliefs are propagated across the modalities to jointly derive the entity labels. We apply the framework to a challenging dataset of wildlife documentaries with subtitles and show that this integrated modeling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The approaches described here bring us closer to automated multimedia indexing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.cisco.com/c/dam/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.pdf.

  2. We experiment with both gold mentions and those detected automatically using Section 5.

  3. We use maximum of the probabilities instead of mean or minimum because the most influential candidates are those that are closer to the said mention and have a high pair-wise score in the back-pointer probability matrix.

  4. It is possible that one mention actually refers to multiple animals. For example, ‘The zebra and giraffe peacefully co-exist in Savannah. They are both very valuable to the wildlife ...’. Here, the mention they refers to zebra and giraffe. However, we do not see such cases in our dataset, and ignore this case for simplicity.

  5. This coreference resolution system is deterministic and does not provide the probabilities or strengths among mentions that is essential for our method. This prevents us from using this method from the start.

  6. https://en.wikipedia.org/wiki/Great_Wildlife_Moments.

  7. Here, we look at the initial set of node potentials (obtained using the back-pointer probabilities from the coreference resolver of Durrett and Klein [9]) and assign each mention to the name with largest probability.

  8. We tested the statistical significance of the results using a mention-level paired t-test and found that the LBP method was significantly better than Init (p < 0.01).

  9. The global inferencing over text and vision for the entire video was quite fast. On an Intel Xeon CPU E5-2687W processor with 3.10GHz, the LBP took 0.65997 seconds, while VIT and Gibbs took 0.76278 and 0.67559 seconds respectively.

  10. We tested the statistical significance of the results using a frame-level paired t-test and found that the LBP method was significantly better than Init both in terms of precision (p < 0.001) and recall (p = 0.0093).

References

  1. Afkham HM, Targhi AT, Eklundh JO, Pronobis A (2008) Joint visual vocabulary for animal classification. In: 19th International conference on pattern recognition. IEEE, pp 1–4

  2. Alfonseca E, Manandhar S (2002) An unsupervised method for general named entity recognition and automated concept discovery. In: Proceedings of the 1st international conference on general WordNet. Mysore, pp 34–43

  3. Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. In: The first international conference on language resources and evaluation workshop on linguistics coreference, vol 1. Citeseer, pp 563–566

  4. Berg TL, Forsyth DA (2006) Animals on the web. In: 2006 IEEE Computer society conference on computer vision and pattern recognition, vol 2. IEEE, pp 1463–1470

  5. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531

  6. Coates-Stephens S (1992) The analysis and acquisition of proper names for the understanding of free text. Comput Hum 26(5-6):441–456

    Article  Google Scholar 

  7. Coughlan JM, Ferreira SJ (2002) Finding deformable shapes using loopy belief propagation. In: European Conference on computer vision. Springer, pp 453–468

  8. Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 248–255

  9. Durrett G, Klein D (2013) Easy victories and uphill battles in coreference resolution. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics. Seattle

  10. Durrett G, Klein D (2014) A joint model for entity analysis: coreference, typing, and linking. In: Transactions of the association for computational linguistics

  11. Dusart T, Nurani Venkitasubramanian A, Moens M F (2013) Cross-modal alignment for wildlife recognition. In: Proceedings of the 2nd ACM international workshop on multimedia analysis for ecological data. ACM, pp 9–14

  12. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  13. Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollár P., Gao J, He X, Mitchell M, Platt J et al (2014) From captions to visual concepts and back. arXiv preprint arXiv:1411.4952

  14. Gomez A, Salazar A (2016) Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. arXiv preprint arXiv:1603.06169

  15. Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: 2013 IEEE International conference on computer vision (ICCV). IEEE, pp 2712–2719

  16. Guillaumin M, Mensink T, Verbeek J, Schmid C (2008) Automatic face naming with caption-based supervision. In: IEEE Conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8

  17. Hariharan B, Girshick R (2016) Low-shot visual object recognition. arXiv preprint arXiv:1606.02819

  18. Hellier P, Demoulin V, Oisel L, Pérez P. (2012) A contrario shot detection. In: 2012 19th IEEE international conference on image processing. IEEE, pp 3085–3088

  19. Joly A, Goëau H, Glotin H, Spampinato C, Bonnet P, Vellinga WP, Planqué R, Rauber A, Palazzo S, Fisher B et al (2015) Lifeclef 2015: multimedia life species identification challenges. In: International conference of the cross-language evaluation forum for European languages. Springer, pp 462–483

  20. Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306

  21. Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) Referit game: referring to objects in photographs of natural scenes. In: EMNLP

  22. Khosla A, Jayadevaprakash N, Yao B, Li FFL (2011) Novel dataset for fine-grained image categorization. In: First workshop on fine-grained visual categorization, CVPR (2011). Citeseer

  23. Kong C, Lin D, Bansal M, Urtasun R, Fidler S (2014) What are you talking about? Text-to-image coreference. In: 2014 IEEE Conference on computer vision and pattern recognition. IEEE, pp 3558–3565

  24. Lee H, Chang A, Peirsman Y, Chambers N, Surdeanu M, Jurafsky D (2013) Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput Linguist 39(4):885–916

    Article  Google Scholar 

  25. Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinf 6(4):357–369

    Article  Google Scholar 

  26. Liu X, Li Y, Wu H, Zhou M, Wei F, Lu Y (2013) Entity linking for tweets. In: ACL, vol 1, pp 1304–1311

  27. Luo X (2005) On coreference resolution performance metrics. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, , pp 25–32

  28. Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38:39–41

    Article  Google Scholar 

  29. Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1):3–26

    Article  Google Scholar 

  30. Pearl J (2014) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann

  31. Pham PT, Moens MF, Tuytelaars T (2010) Cross-media alignment of names and faces. IEEE Trans Multimed 12(1):13–27

    Article  Google Scholar 

  32. Pham PT, Tuytelaars T, Moens MF (2011) Naming people in news videos with label propagation. IEEE Multimed 18(3):44–55

    Article  Google Scholar 

  33. Pradhan S, Luo X, Recasens M, Hovy E, Ng V, Strube M (2014) Scoring coreference partitions of predicted mentions: a reference implementation

  34. Pradhan S, Ramshaw L, Marcus M, Palmer M, Weischedel R, Xue N (2011) Conll-2011 shared task: modeling unrestricted coreference in ontonotes. In: Proceedings of the Fifteenth conference on computational natural language learning: shared task. Association for Computational Linguistics, pp 1–27

  35. Ramanan D, Forsyth DA, Barnard K (2006) Building models of animals from video. IEEE Trans Pattern Anal Mach Intell 28(8):1319–1334

    Article  Google Scholar 

  36. Ramanathan V, Joulin A, Liang P, Fei-Fei L (2014) Linking people in videos with “their” names using coreference resolution. In: European conference on computer vision. Springer, pp 95–110

  37. Roth D, Yih WT (2002) Probabilistic reasoning for entity & relation recognition. In: Proceedings of the 19th international conference on computational linguistics, vol 1. Association for Computational Linguistics, , pp 1–7

  38. Schmid C (2001) Constructing models for content-based image retrieval. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition, 2001. CVPR 2001, vol 2. IEEE, pp II–39

  39. Schmidt M (2012) UGM: Matlab code for undirected graphical models

  40. Seitner J, Bizer C, Eckert K, Faralli S, Meusel R, Paulheim H, Ponzetto S (2016) A large database of hypernymy relations extracted from the web. In: Proceedings of the 10th edition of the language resources and evaluation conference. Portoroz

  41. Shen W, Wang J, Han J (2015) Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Trans Knowl Data Eng 27(2):443–460

    Article  Google Scholar 

  42. Venkitasubramanian AN, Tuytelaars T, Moens MF (2016) Wildlife recognition in nature documentaries with weak supervision from subtitles and external data. Pattern Recogn Lett Elsev

  43. Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L (1995) A model-theoretic coreference scoring scheme. In: Proceedings of the 6th conference on message understanding. Association for Computational Linguistics, pp 45–52

  44. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD birds-200-2011 dataset, Technical Report CNS-TR-2011-001 California Institute of Technology

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aparna Nurani Venkitasubramanian.

Appendix: Metrics for evaluating the entity linking on text

Appendix: Metrics for evaluating the entity linking on text

We denote a set of mentions referring to the same entity as an entity cluster. Given a set of key (ground-truth) entity clusters K, and a set of response (system-generated) entity clusters R, with each entity cluster comprising one or more mentions, each metric generates its variation of a precision and recall measure. The MUC measure is the oldest and most widely used. It focuses on the links (or pairs of mentions) in the data. The number of common links between elements in K and R divided by the number of links in K represents the recall, whereas, precision is the number of common links between elements in K and R divided by the number of links in R. This metric prefers systems that have more mentions per cluster; a system that creates a single cluster of all the mentions will get a 100% recall without significant degradation in its precision. It ignores recall for singleton clusters, or entities with only one mention.

The B3 metric tries to addresses MUC’s shortcomings, by focusing on the mentions and computes recall and precision scores for each mention. If K is the key entity cluster containing mention m, and R is the response entity cluster containing mention M, then recall for the mention m is computed as \(\frac {|\mathsf {K} \cap \mathsf {R}|}{|\mathsf {K}|}\) and precision for the same is is computed as \(\frac {|\mathsf {K} \cap \mathsf {R}|}{|\mathsf {R}|}\). Overall recall and precision are the average of the individual mention scores.

CEAF aligns every response cluster with at most one key cluster by finding the best one-to-one mapping between the clusters using an entity similarity metric. This is a maximum bipartite matching problem solved by the Kuhn-Munkres algorithm. This metric works at the level of the entity cluster. Depending on the similarity, there are two variations: a) entity based CEAF - CEAF e and b) mention based CEAF - CEAF m . Recall is the total similarity divided by the number of mentions in K, and precision is the total similarity divided by the number of mentions in R. In this work, we use CEAF e for evaluation, similar to the state-of-the-art coreference resolution and entity linking systems [9, 10, 24].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Venkitasubramanian, A.N., Tuytelaars, T. & Moens, MF. Entity linking across vision and language. Multimed Tools Appl 76, 22599–22622 (2017). https://doi.org/10.1007/s11042-017-4732-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4732-8

Keywords

Navigation