Abstract
Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can’t help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a “20 mph” sign alongside a road, we might assume the street sits in a residential area (rather than on a highway), even if no houses are pictured. Can machines perform similar visual reasoning?
We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents. We adopt a free-viewing paradigm: participants first observe and identify salient clues within images (e.g., objects, actions) and then provide a plausible inference about the scene, given the clue. In total, we collect 363K (clue, inference) pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using our corpus, we test three complementary axes of abductive reasoning. We evaluate the capacity of models to: i) retrieve relevant inferences from a large candidate corpus; ii) localize evidence for inferences via bounding boxes, and iii) compare plausible inferences to match human judgments on a newly-collected diagnostic corpus of 19K Likert-scale judgments. While we find that fine-tuning CLIP-RN50 \(\,\times \,\) 64 with a multitask objective outperforms strong baselines, significant headroom exists between model performance and human agreement. Data, models, and leaderboard available at http://visualabduction.com/.
You know my method.
It is founded upon the observation of trifles.
“The Boscombe Valley Mystery”, by A. C. Doyle
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The correctness of abductive reasoning is certainly not guaranteed. Our goal is to study perception and reasoning without endorsing specific inferences (see Sect. 3.1).
- 3.
- 4.
We reserve generative evaluations (e.g., BLEU/CIDEr) for future work: shortcuts (e.g., outputting the technically correct “this is a photo" for all inputs) make generation evaluation difficult in the abductive setting (see Sect. 6). Nonetheless, generative models can be evaluated in our setup; we experiment with one in Sect. 5.1.
- 5.
https://www.perspectiveapi.com/; November 2021 version. The API (which itself is imperfect and has biases [18, 38, 55]) assigns toxicity value 0–1 for a given input text. Toxicity is defined as “a rude, disrespectful, or unreasonable comment that is likely to make one leave a discussion”.
- 6.
As discussed in Sect. 3, N has a mean/median of 1.17/1.0 across the corpus.
- 7.
In §B.1, for completeness, we give results on the retrieval and localization setups, but testing on clues instead.
- 8.
Our validation/test sets contain about 23 K inferences. For efficiency we randomly split into 23 equal sized chunks of about 1 K inferences, and report retrieval averaged over the resulting splits.
- 9.
Since the annotators were able to specify multiple bounding boxes per observation pair, we count a match to any of the labeled bounding boxes.
- 10.
A small number of images do not have a ResNeXt bounding box with IoU \(>0.5\) with any ground truth bounding box: in Sect. 5.1, we show that most instances (96.2%) are solvable with this setup.
- 11.
- 12.
In Sect. 5.1, we show that models achieve significantly less correlation compared to human agreement.
References
Aliseda, A.: The logic of abduction: an introduction. In: Magnani, L., Bertolotti, T. (eds.) Springer Handbook of Model-Based Science. SH, pp. 219–230. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-30526-4_10
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Bender, E.M., Friedman, B.: Data statements for natural language processing: toward mitigating system bias and enabling better science. TACL 6, 587–604 (2018)
Berg, A.C., et al.: Understanding and predicting importance in images. In: CVPR (2012)
Bhagavatula, C., et al.: Abductive commonsense reasoning. In: ICLR (2020)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
Carson, D.: The abduction of sherlock holmes. Int. J. Police Sci. Manage. 11(2), 193–202 (2009)
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Yu., Liu, J.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR (2021)
Du, L., Ding, X., Liu, T., Qin, B.: Learning event graph knowledge for abductive reasoning. In: ACL (2021)
Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2Commonsense: generating commonsense descriptions to enrich video captioning. In: EMNLP (2020)
Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT vqa: answering knowledge-based questions about videos. In: AAAI (2020)
Gebru, T., et al.: Datasheets for datasets. Commun. ACM 64(12), 86–92 (2021)
Grice, H.P.: Logic and conversation. In: Speech Acts, pp. 41–58. Brill (1975)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hobbs, J.R., Stickel, M.E., Appelt, D.E., Martin, P.: Interpretation as abduction. Artif. Intell. 63(1–2), 69–142 (1993)
Hosseini, H., Kannan, S., Zhang, B., Poovendran, R.: Deceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 (2017)
Ignat, O., Castro, S., Miao, H., Li, W., Mihalcea, R.: WhyAct: identifying action reasons in lifestyle vlogs. In: EMNLP (2021)
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: Tgif-QA: toward spatio-temporal reasoning in visual question answering. In: CVPR (2017)
Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR (2016)
Johnson, J., et al.: Image retrieval using scene graphs. In: CVPR (2015)
Jonker, R., Volgenant, A.: A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4), 325–340 (1987). https://doi.org/10.1007/BF02278710
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Kim, H., Zala, A., Bansal, M.: CoSIm: commonsense reasoning for counterfactual scene imagination. In: NAACL (2022)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krahmer, E., Van Deemter, K.: Computational generation of referring expressions: a survey. Comput. Linguist. 38(1), 173–218 (2012)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2016). https://doi.org/10.1007/S11263-016-0981-7
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. In: ACL (2020)
Lei, J., Yu, L., Berg, T.L., Bansal, M.: What is more likely to happen next? video-and-language future event prediction. In: EMNLP (2020)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)
Liu, J., et al.: Violin: a large-scale dataset for video-and-language inference. In: CVPR (2020)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Mitchell, M., et al.: Model cards for model reporting. In: FAccT (2019)
Niiniluoto, I.: Defending abduction. Philos. Sci. 66, S436–S451 (1999)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS (2011)
Ovchinnikova, E., Montazeri, N., Alexandrov, T., Hobbs, J.R., McCord, M.C., Mulkar-Mehta, R.: Abductive reasoning with a large knowledge base for discourse processing. In: IWCS (2011)
Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: ICCV (2019)
Park, J.S., Bhagavatula, C., Mottaghi, R., Farhadi, A., Choi, Y.: VisualCOMET: reasoning about the dynamic context of a still image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 508–524. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_30
Paul, D., Frank, A.: Generating hypothetical events for abductive inference. In: *SEM (2021)
Peirce, C.S.: Philosophical Writings of Peirce, vol. 217. Courier Corporation (1955)
Peirce, C.S.: Pragmatism and Pragmaticism, vol. 5. Belknap Press of Harvard University Press (1965)
Pezzelle, S., Greco, C., Gandolfi, G., Gualdoni, E., Bernardi, R.: Be different to be better! a benchmark to leverage the complementarity of language and vision. In: Findings of EMNLP (2020)
Pirsiavash, H., Vondrick, C., Torralba, A.: Inferring the why in images. Technical report (2014)
Qin, L., et al.: Back to the future: unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning. In: EMNLP (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: EMNLP (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Sap, M., Card, D., Gabriel, S., Choi, Y., Smith, N.A.: The risk of racial bias in hate speech detection. In: ACL (2019)
Shank, G.: The extraordinary ordinary powers of abductive reasoning. Theor. Psychol. 8(6), 841–860 (1998)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Shazeer, N., Stern, M.: Adafactor: adaptive learning rates with sublinear memory cost. In: ICML (2018)
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: NeurIPS (2016)
Tafjord, O., Mishra, B.D., Clark, P.: ProofWriter: generating implications, proofs, and abductive statements over natural language. In: Findings of ACL (2021)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Vedantam, R., Lin, X., Batra, T., Zitnick, C.L., Parikh, D.: Learning common sense through visual abstraction. In: ICCV (2015)
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: fact-based visual question answering. TPAMI 40(10), 2413–2427 (2017)
Wang, P., Wu, Q., Shen, C., Hengel, A.V.D., Dick, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP: System Demonstrations (2020)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T.S., Sun, M.: CPT: colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797 (2021)
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. In: ICLR (2020)
Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual Madlibs: fill in the blank image generation and question answering. In: ICCV (2015)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Zadeh, A., Chan, M., Liang, P.P., Tong, E., Morency, L.P.: Social-iq: a question answering benchmark for artificial social intelligence. In: CVPR (2019)
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: CVPR (2019)
Zellers, R., et al.: MERLOT: multimodal neural script knowledge models. In: NeurIPS (2021)
Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: a dataset for relational and analogical visual reasoning. In: CVPR (2019)
Zhang, H., Huo, Y., Zhao, X., Song, Y., Roth, D.: Learning contextual causality from time-consecutive images. In: CVPR Workshops (2021)
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
Acknowledgments
This work was funded by DARPA MCS program through NIWC Pacific (N66001-19-2-4031), the DARPA SemaFor program, and the Allen Institute for AI. AR was additionally in part supported by the DARPA PTG program, as well as BAIR’s industrial alliance program. We additionally thank the UC Berkeley Semafor group for the helpful discussions and feedback.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hessel, J. et al. (2022). The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-20059-5_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)