The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

Hessel, Jack; Hwang, Jena D.; Park, Jae Sung; Zellers, Rowan; Bhagavatula, Chandra; Rohrbach, Anna; Saenko, Kate; Choi, Yejin

doi:10.1007/978-3-031-20059-5_32

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Included in the following conference series:

European Conference on Computer Vision

1974 Accesses
2 Citations

Abstract

Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can’t help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a “20 mph” sign alongside a road, we might assume the street sits in a residential area (rather than on a highway), even if no houses are pictured. Can machines perform similar visual reasoning?

We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents. We adopt a free-viewing paradigm: participants first observe and identify salient clues within images (e.g., objects, actions) and then provide a plausible inference about the scene, given the clue. In total, we collect 363K (clue, inference) pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using our corpus, we test three complementary axes of abductive reasoning. We evaluate the capacity of models to: i) retrieve relevant inferences from a large candidate corpus; ii) localize evidence for inferences via bounding boxes, and iii) compare plausible inferences to match human judgments on a newly-collected diagnostic corpus of 19K Likert-scale judgments. While we find that fine-tuning CLIP-RN50 \(\,\times \,\) 64 with a multitask objective outperforms strong baselines, significant headroom exists between model performance and human agreement. Data, models, and leaderboard available at http://visualabduction.com/.

You know my method.

It is founded upon the observation of trifles.

“The Boscombe Valley Mystery”, by A. C. Doyle

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
While Holmes rarely makes mistakes, he frequently misidentifies his mostly abductive process of reasoning as “deductive” [8, 39].
2.
The correctness of abductive reasoning is certainly not guaranteed. Our goal is to study perception and reasoning without endorsing specific inferences (see Sect. 3.1).
3.
For instance, 94% of visual references in [75] are about depicted actors, and [44] even requires KB entries to explicitly regard people; see Fig. 2.
4.
We reserve generative evaluations (e.g., BLEU/CIDEr) for future work: shortcuts (e.g., outputting the technically correct “this is a photo" for all inputs) make generation evaluation difficult in the abductive setting (see Sect. 6). Nonetheless, generative models can be evaluated in our setup; we experiment with one in Sect. 5.1.
5.
https://www.perspectiveapi.com/; November 2021 version. The API (which itself is imperfect and has biases [18, 38, 55]) assigns toxicity value 0–1 for a given input text. Toxicity is defined as “a rude, disrespectful, or unreasonable comment that is likely to make one leave a discussion”.
6.
As discussed in Sect. 3, N has a mean/median of 1.17/1.0 across the corpus.
7.
In §B.1, for completeness, we give results on the retrieval and localization setups, but testing on clues instead.
8.
Our validation/test sets contain about 23 K inferences. For efficiency we randomly split into 23 equal sized chunks of about 1 K inferences, and report retrieval averaged over the resulting splits.
9.
Since the annotators were able to specify multiple bounding boxes per observation pair, we count a match to any of the labeled bounding boxes.
10.
A small number of images do not have a ResNeXt bounding box with IoU \(>0.5\) with any ground truth bounding box: in Sect. 5.1, we show that most instances (96.2%) are solvable with this setup.
11.
Specifically, a CLIP RN50\(\,\times \,\)16 checkpoint that achieves strong validation retrieval performance (comparable to the checkpoint of the reported test results in Sect. 5.1); model details in Sect. 5.
12.
In Sect. 5.1, we show that models achieve significantly less correlation compared to human agreement.

References

Aliseda, A.: The logic of abduction: an introduction. In: Magnani, L., Bertolotti, T. (eds.) Springer Handbook of Model-Based Science. SH, pp. 219–230. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-30526-4_10
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Bender, E.M., Friedman, B.: Data statements for natural language processing: toward mitigating system bias and enabling better science. TACL 6, 587–604 (2018)
Article Google Scholar
Berg, A.C., et al.: Understanding and predicting importance in images. In: CVPR (2012)
Google Scholar
Bhagavatula, C., et al.: Abductive commonsense reasoning. In: ICLR (2020)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
MATH Google Scholar
Carson, D.: The abduction of sherlock holmes. Int. J. Police Sci. Manage. 11(2), 193–202 (2009)
Article Google Scholar
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Yu., Liu, J.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Du, L., Ding, X., Liu, T., Qin, B.: Learning event graph knowledge for abductive reasoning. In: ACL (2021)
Google Scholar
Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2Commonsense: generating commonsense descriptions to enrich video captioning. In: EMNLP (2020)
Google Scholar
Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT vqa: answering knowledge-based questions about videos. In: AAAI (2020)
Google Scholar
Gebru, T., et al.: Datasheets for datasets. Commun. ACM 64(12), 86–92 (2021)
Article Google Scholar
Grice, H.P.: Logic and conversation. In: Speech Acts, pp. 41–58. Brill (1975)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hobbs, J.R., Stickel, M.E., Appelt, D.E., Martin, P.: Interpretation as abduction. Artif. Intell. 63(1–2), 69–142 (1993)
Article Google Scholar
Hosseini, H., Kannan, S., Zhang, B., Poovendran, R.: Deceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 (2017)
Ignat, O., Castro, S., Miao, H., Li, W., Mihalcea, R.: WhyAct: identifying action reasons in lifestyle vlogs. In: EMNLP (2021)
Google Scholar
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: Tgif-QA: toward spatio-temporal reasoning in visual question answering. In: CVPR (2017)
Google Scholar
Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR (2016)
Google Scholar
Johnson, J., et al.: Image retrieval using scene graphs. In: CVPR (2015)
Google Scholar
Jonker, R., Volgenant, A.: A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4), 325–340 (1987). https://doi.org/10.1007/BF02278710
Article MathSciNet MATH Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Google Scholar
Kim, H., Zala, A., Bansal, M.: CoSIm: commonsense reasoning for counterfactual scene imagination. In: NAACL (2022)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krahmer, E., Van Deemter, K.: Computational generation of referring expressions: a survey. Comput. Linguist. 38(1), 173–218 (2012)
Article Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2016). https://doi.org/10.1007/S11263-016-0981-7
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Article MathSciNet MATH Google Scholar
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. In: ACL (2020)
Google Scholar
Lei, J., Yu, L., Berg, T.L., Bansal, M.: What is more likely to happen next? video-and-language future event prediction. In: EMNLP (2020)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)
Google Scholar
Liu, J., et al.: Violin: a large-scale dataset for video-and-language inference. In: CVPR (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)
Google Scholar
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Google Scholar
Mitchell, M., et al.: Model cards for model reporting. In: FAccT (2019)
Google Scholar
Niiniluoto, I.: Defending abduction. Philos. Sci. 66, S436–S451 (1999)
Article MathSciNet Google Scholar
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS (2011)
Google Scholar
Ovchinnikova, E., Montazeri, N., Alexandrov, T., Hobbs, J.R., McCord, M.C., Mulkar-Mehta, R.: Abductive reasoning with a large knowledge base for discourse processing. In: IWCS (2011)
Google Scholar
Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: ICCV (2019)
Google Scholar
Park, J.S., Bhagavatula, C., Mottaghi, R., Farhadi, A., Choi, Y.: VisualCOMET: reasoning about the dynamic context of a still image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 508–524. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_30
Chapter Google Scholar
Paul, D., Frank, A.: Generating hypothetical events for abductive inference. In: *SEM (2021)
Google Scholar
Peirce, C.S.: Philosophical Writings of Peirce, vol. 217. Courier Corporation (1955)
Google Scholar
Peirce, C.S.: Pragmatism and Pragmaticism, vol. 5. Belknap Press of Harvard University Press (1965)
Google Scholar
Pezzelle, S., Greco, C., Gandolfi, G., Gualdoni, E., Bernardi, R.: Be different to be better! a benchmark to leverage the complementarity of language and vision. In: Findings of EMNLP (2020)
Google Scholar
Pirsiavash, H., Vondrick, C., Torralba, A.: Inferring the why in images. Technical report (2014)
Google Scholar
Qin, L., et al.: Back to the future: unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning. In: EMNLP (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: EMNLP (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Sap, M., Card, D., Gabriel, S., Choi, Y., Smith, N.A.: The risk of racial bias in hate speech detection. In: ACL (2019)
Google Scholar
Shank, G.: The extraordinary ordinary powers of abductive reasoning. Theor. Psychol. 8(6), 841–860 (1998)
Article Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Google Scholar
Shazeer, N., Stern, M.: Adafactor: adaptive learning rates with sublinear memory cost. In: ICML (2018)
Google Scholar
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: NeurIPS (2016)
Google Scholar
Tafjord, O., Mishra, B.D., Clark, P.: ProofWriter: generating implications, proofs, and abductive statements over natural language. In: Findings of ACL (2021)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)
Google Scholar
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Vedantam, R., Lin, X., Batra, T., Zitnick, C.L., Parikh, D.: Learning common sense through visual abstraction. In: ICCV (2015)
Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: fact-based visual question answering. TPAMI 40(10), 2413–2427 (2017)
Article Google Scholar
Wang, P., Wu, Q., Shen, C., Hengel, A.V.D., Dick, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP: System Demonstrations (2020)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
Google Scholar
Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T.S., Sun, M.: CPT: colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797 (2021)
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. In: ICLR (2020)
Google Scholar
Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual Madlibs: fill in the blank image generation and question answering. In: ICCV (2015)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Zadeh, A., Chan, M., Liang, P.P., Tong, E., Morency, L.P.: Social-iq: a question answering benchmark for artificial social intelligence. In: CVPR (2019)
Google Scholar
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: CVPR (2019)
Google Scholar
Zellers, R., et al.: MERLOT: multimodal neural script knowledge models. In: NeurIPS (2021)
Google Scholar
Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: a dataset for relational and analogical visual reasoning. In: CVPR (2019)
Google Scholar
Zhang, H., Huo, Y., Zhao, X., Song, Y., Roth, D.: Learning contextual causality from time-consecutive images. In: CVPR Workshops (2021)
Google Scholar
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
Google Scholar

Download references

Acknowledgments

This work was funded by DARPA MCS program through NIWC Pacific (N66001-19-2-4031), the DARPA SemaFor program, and the Allen Institute for AI. AR was additionally in part supported by the DARPA PTG program, as well as BAIR’s industrial alliance program. We additionally thank the UC Berkeley Semafor group for the helpful discussions and feedback.

Author information

Authors and Affiliations

Allen Institute for AI, Seattle, USA
Jack Hessel, Jena D. Hwang, Chandra Bhagavatula & Yejin Choi
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
Jae Sung Park, Rowan Zellers & Yejin Choi
University of California, Berkeley, USA
Anna Rohrbach
Boston University, Boston, USA
Kate Saenko
MIT-IBM Watson AI, Cambridge, USA
Kate Saenko

Authors

Jack Hessel
View author publications
You can also search for this author in PubMed Google Scholar
Jena D. Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Jae Sung Park
View author publications
You can also search for this author in PubMed Google Scholar
Rowan Zellers
View author publications
You can also search for this author in PubMed Google Scholar
Chandra Bhagavatula
View author publications
You can also search for this author in PubMed Google Scholar
Anna Rohrbach
View author publications
You can also search for this author in PubMed Google Scholar
Kate Saenko
View author publications
You can also search for this author in PubMed Google Scholar
Yejin Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jack Hessel or Jena D. Hwang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1777 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hessel, J. et al. (2022). The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-20059-5_32
Published: 29 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning