Abstract
We propose a novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language. Given a video with subtitles, we jointly address the questions: a) What do the textual entity mentions refer to? and b) What/ who are in the video key frames? We use a Markov Random Field (MRF) to encode the dependencies within and across the two modalities. This MRF model incorporates beliefs using independent methods for the textual and visual entities. These beliefs are propagated across the modalities to jointly derive the entity labels. We apply the framework to a challenging dataset of wildlife documentaries with subtitles and show that this integrated modeling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The approaches described here bring us closer to automated multimedia indexing.
Similar content being viewed by others
Notes
We experiment with both gold mentions and those detected automatically using Section 5.
We use maximum of the probabilities instead of mean or minimum because the most influential candidates are those that are closer to the said mention and have a high pair-wise score in the back-pointer probability matrix.
It is possible that one mention actually refers to multiple animals. For example, ‘The zebra and giraffe peacefully co-exist in Savannah. They are both very valuable to the wildlife ...’. Here, the mention they refers to zebra and giraffe. However, we do not see such cases in our dataset, and ignore this case for simplicity.
This coreference resolution system is deterministic and does not provide the probabilities or strengths among mentions that is essential for our method. This prevents us from using this method from the start.
Here, we look at the initial set of node potentials (obtained using the back-pointer probabilities from the coreference resolver of Durrett and Klein [9]) and assign each mention to the name with largest probability.
We tested the statistical significance of the results using a mention-level paired t-test and found that the LBP method was significantly better than Init (p < 0.01).
The global inferencing over text and vision for the entire video was quite fast. On an Intel Xeon CPU E5-2687W processor with 3.10GHz, the LBP took 0.65997 seconds, while VIT and Gibbs took 0.76278 and 0.67559 seconds respectively.
We tested the statistical significance of the results using a frame-level paired t-test and found that the LBP method was significantly better than Init both in terms of precision (p < 0.001) and recall (p = 0.0093).
References
Afkham HM, Targhi AT, Eklundh JO, Pronobis A (2008) Joint visual vocabulary for animal classification. In: 19th International conference on pattern recognition. IEEE, pp 1–4
Alfonseca E, Manandhar S (2002) An unsupervised method for general named entity recognition and automated concept discovery. In: Proceedings of the 1st international conference on general WordNet. Mysore, pp 34–43
Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. In: The first international conference on language resources and evaluation workshop on linguistics coreference, vol 1. Citeseer, pp 563–566
Berg TL, Forsyth DA (2006) Animals on the web. In: 2006 IEEE Computer society conference on computer vision and pattern recognition, vol 2. IEEE, pp 1463–1470
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531
Coates-Stephens S (1992) The analysis and acquisition of proper names for the understanding of free text. Comput Hum 26(5-6):441–456
Coughlan JM, Ferreira SJ (2002) Finding deformable shapes using loopy belief propagation. In: European Conference on computer vision. Springer, pp 453–468
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 248–255
Durrett G, Klein D (2013) Easy victories and uphill battles in coreference resolution. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics. Seattle
Durrett G, Klein D (2014) A joint model for entity analysis: coreference, typing, and linking. In: Transactions of the association for computational linguistics
Dusart T, Nurani Venkitasubramanian A, Moens M F (2013) Cross-modal alignment for wildlife recognition. In: Proceedings of the 2nd ACM international workshop on multimedia analysis for ecological data. ACM, pp 9–14
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollár P., Gao J, He X, Mitchell M, Platt J et al (2014) From captions to visual concepts and back. arXiv preprint arXiv:1411.4952
Gomez A, Salazar A (2016) Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. arXiv preprint arXiv:1603.06169
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: 2013 IEEE International conference on computer vision (ICCV). IEEE, pp 2712–2719
Guillaumin M, Mensink T, Verbeek J, Schmid C (2008) Automatic face naming with caption-based supervision. In: IEEE Conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Hariharan B, Girshick R (2016) Low-shot visual object recognition. arXiv preprint arXiv:1606.02819
Hellier P, Demoulin V, Oisel L, Pérez P. (2012) A contrario shot detection. In: 2012 19th IEEE international conference on image processing. IEEE, pp 3085–3088
Joly A, Goëau H, Glotin H, Spampinato C, Bonnet P, Vellinga WP, Planqué R, Rauber A, Palazzo S, Fisher B et al (2015) Lifeclef 2015: multimedia life species identification challenges. In: International conference of the cross-language evaluation forum for European languages. Springer, pp 462–483
Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306
Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) Referit game: referring to objects in photographs of natural scenes. In: EMNLP
Khosla A, Jayadevaprakash N, Yao B, Li FFL (2011) Novel dataset for fine-grained image categorization. In: First workshop on fine-grained visual categorization, CVPR (2011). Citeseer
Kong C, Lin D, Bansal M, Urtasun R, Fidler S (2014) What are you talking about? Text-to-image coreference. In: 2014 IEEE Conference on computer vision and pattern recognition. IEEE, pp 3558–3565
Lee H, Chang A, Peirsman Y, Chambers N, Surdeanu M, Jurafsky D (2013) Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput Linguist 39(4):885–916
Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinf 6(4):357–369
Liu X, Li Y, Wu H, Zhou M, Wei F, Lu Y (2013) Entity linking for tweets. In: ACL, vol 1, pp 1304–1311
Luo X (2005) On coreference resolution performance metrics. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, , pp 25–32
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38:39–41
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1):3–26
Pearl J (2014) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann
Pham PT, Moens MF, Tuytelaars T (2010) Cross-media alignment of names and faces. IEEE Trans Multimed 12(1):13–27
Pham PT, Tuytelaars T, Moens MF (2011) Naming people in news videos with label propagation. IEEE Multimed 18(3):44–55
Pradhan S, Luo X, Recasens M, Hovy E, Ng V, Strube M (2014) Scoring coreference partitions of predicted mentions: a reference implementation
Pradhan S, Ramshaw L, Marcus M, Palmer M, Weischedel R, Xue N (2011) Conll-2011 shared task: modeling unrestricted coreference in ontonotes. In: Proceedings of the Fifteenth conference on computational natural language learning: shared task. Association for Computational Linguistics, pp 1–27
Ramanan D, Forsyth DA, Barnard K (2006) Building models of animals from video. IEEE Trans Pattern Anal Mach Intell 28(8):1319–1334
Ramanathan V, Joulin A, Liang P, Fei-Fei L (2014) Linking people in videos with “their” names using coreference resolution. In: European conference on computer vision. Springer, pp 95–110
Roth D, Yih WT (2002) Probabilistic reasoning for entity & relation recognition. In: Proceedings of the 19th international conference on computational linguistics, vol 1. Association for Computational Linguistics, , pp 1–7
Schmid C (2001) Constructing models for content-based image retrieval. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition, 2001. CVPR 2001, vol 2. IEEE, pp II–39
Schmidt M (2012) UGM: Matlab code for undirected graphical models
Seitner J, Bizer C, Eckert K, Faralli S, Meusel R, Paulheim H, Ponzetto S (2016) A large database of hypernymy relations extracted from the web. In: Proceedings of the 10th edition of the language resources and evaluation conference. Portoroz
Shen W, Wang J, Han J (2015) Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Trans Knowl Data Eng 27(2):443–460
Venkitasubramanian AN, Tuytelaars T, Moens MF (2016) Wildlife recognition in nature documentaries with weak supervision from subtitles and external data. Pattern Recogn Lett Elsev
Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L (1995) A model-theoretic coreference scoring scheme. In: Proceedings of the 6th conference on message understanding. Association for Computational Linguistics, pp 45–52
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD birds-200-2011 dataset, Technical Report CNS-TR-2011-001 California Institute of Technology
Author information
Authors and Affiliations
Corresponding author
Appendix: Metrics for evaluating the entity linking on text
Appendix: Metrics for evaluating the entity linking on text
We denote a set of mentions referring to the same entity as an entity cluster. Given a set of key (ground-truth) entity clusters K, and a set of response (system-generated) entity clusters R, with each entity cluster comprising one or more mentions, each metric generates its variation of a precision and recall measure. The MUC measure is the oldest and most widely used. It focuses on the links (or pairs of mentions) in the data. The number of common links between elements in K and R divided by the number of links in K represents the recall, whereas, precision is the number of common links between elements in K and R divided by the number of links in R. This metric prefers systems that have more mentions per cluster; a system that creates a single cluster of all the mentions will get a 100% recall without significant degradation in its precision. It ignores recall for singleton clusters, or entities with only one mention.
The B3 metric tries to addresses MUC’s shortcomings, by focusing on the mentions and computes recall and precision scores for each mention. If K is the key entity cluster containing mention m, and R is the response entity cluster containing mention M, then recall for the mention m is computed as \(\frac {|\mathsf {K} \cap \mathsf {R}|}{|\mathsf {K}|}\) and precision for the same is is computed as \(\frac {|\mathsf {K} \cap \mathsf {R}|}{|\mathsf {R}|}\). Overall recall and precision are the average of the individual mention scores.
CEAF aligns every response cluster with at most one key cluster by finding the best one-to-one mapping between the clusters using an entity similarity metric. This is a maximum bipartite matching problem solved by the Kuhn-Munkres algorithm. This metric works at the level of the entity cluster. Depending on the similarity, there are two variations: a) entity based CEAF - CEAF e and b) mention based CEAF - CEAF m . Recall is the total similarity divided by the number of mentions in K, and precision is the total similarity divided by the number of mentions in R. In this work, we use CEAF e for evaluation, similar to the state-of-the-art coreference resolution and entity linking systems [9, 10, 24].
Rights and permissions
About this article
Cite this article
Venkitasubramanian, A.N., Tuytelaars, T. & Moens, MF. Entity linking across vision and language. Multimed Tools Appl 76, 22599–22622 (2017). https://doi.org/10.1007/s11042-017-4732-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4732-8