Skip to main content

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can’t help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a “20 mph” sign alongside a road, we might assume the street sits in a residential area (rather than on a highway), even if no houses are pictured. Can machines perform similar visual reasoning?

We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents. We adopt a free-viewing paradigm: participants first observe and identify salient clues within images (e.g., objects, actions) and then provide a plausible inference about the scene, given the clue. In total, we collect 363K (clue, inference) pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using our corpus, we test three complementary axes of abductive reasoning. We evaluate the capacity of models to: i) retrieve relevant inferences from a large candidate corpus; ii) localize evidence for inferences via bounding boxes, and iii) compare plausible inferences to match human judgments on a newly-collected diagnostic corpus of 19K Likert-scale judgments. While we find that fine-tuning CLIP-RN50 \(\,\times \,\) 64 with a multitask objective outperforms strong baselines, significant headroom exists between model performance and human agreement. Data, models, and leaderboard available at http://visualabduction.com/.

You know my method.

It is founded upon the observation of trifles.

“The Boscombe Valley Mystery”, by A. C. Doyle

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    While Holmes rarely makes mistakes, he frequently misidentifies his mostly abductive process of reasoning as “deductive” [8, 39].

  2. 2.

    The correctness of abductive reasoning is certainly not guaranteed. Our goal is to study perception and reasoning without endorsing specific inferences (see Sect. 3.1).

  3. 3.

    For instance, 94% of visual references in [75] are about depicted actors, and [44] even requires KB entries to explicitly regard people; see Fig.  2.

  4. 4.

    We reserve generative evaluations (e.g., BLEU/CIDEr) for future work: shortcuts (e.g., outputting the technically correct “this is a photo" for all inputs) make generation evaluation difficult in the abductive setting (see Sect. 6). Nonetheless, generative models can be evaluated in our setup; we experiment with one in Sect. 5.1.

  5. 5.

    https://www.perspectiveapi.com/; November 2021 version. The API (which itself is imperfect and has biases [18, 38, 55]) assigns toxicity value 0–1 for a given input text. Toxicity is defined as “a rude, disrespectful, or unreasonable comment that is likely to make one leave a discussion”.

  6. 6.

    As discussed in Sect. 3, N has a mean/median of 1.17/1.0 across the corpus.

  7. 7.

    In §B.1, for completeness, we give results on the retrieval and localization setups, but testing on clues instead.

  8. 8.

    Our validation/test sets contain about 23 K inferences. For efficiency we randomly split into 23 equal sized chunks of about 1 K inferences, and report retrieval averaged over the resulting splits.

  9. 9.

    Since the annotators were able to specify multiple bounding boxes per observation pair, we count a match to any of the labeled bounding boxes.

  10. 10.

    A small number of images do not have a ResNeXt bounding box with IoU \(>0.5\) with any ground truth bounding box: in Sect.  5.1, we show that most instances (96.2%) are solvable with this setup.

  11. 11.

    Specifically, a CLIP RN50\(\,\times \,\)16 checkpoint that achieves strong validation retrieval performance (comparable to the checkpoint of the reported test results in Sect. 5.1); model details in Sect. 5.

  12. 12.

    In Sect. 5.1, we show that models achieve significantly less correlation compared to human agreement.

References

  1. Aliseda, A.: The logic of abduction: an introduction. In: Magnani, L., Bertolotti, T. (eds.) Springer Handbook of Model-Based Science. SH, pp. 219–230. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-30526-4_10

    Chapter  Google Scholar 

  2. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)

    Google Scholar 

  3. Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)

    Google Scholar 

  4. Bender, E.M., Friedman, B.: Data statements for natural language processing: toward mitigating system bias and enabling better science. TACL 6, 587–604 (2018)

    Article  Google Scholar 

  5. Berg, A.C., et al.: Understanding and predicting importance in images. In: CVPR (2012)

    Google Scholar 

  6. Bhagavatula, C., et al.: Abductive commonsense reasoning. In: ICLR (2020)

    Google Scholar 

  7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)

    MATH  Google Scholar 

  8. Carson, D.: The abduction of sherlock holmes. Int. J. Police Sci. Manage. 11(2), 193–202 (2009)

    Article  Google Scholar 

  9. Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Yu., Liu, J.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  10. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  11. Du, L., Ding, X., Liu, T., Qin, B.: Learning event graph knowledge for abductive reasoning. In: ACL (2021)

    Google Scholar 

  12. Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2Commonsense: generating commonsense descriptions to enrich video captioning. In: EMNLP (2020)

    Google Scholar 

  13. Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT vqa: answering knowledge-based questions about videos. In: AAAI (2020)

    Google Scholar 

  14. Gebru, T., et al.: Datasheets for datasets. Commun. ACM 64(12), 86–92 (2021)

    Article  Google Scholar 

  15. Grice, H.P.: Logic and conversation. In: Speech Acts, pp. 41–58. Brill (1975)

    Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  17. Hobbs, J.R., Stickel, M.E., Appelt, D.E., Martin, P.: Interpretation as abduction. Artif. Intell. 63(1–2), 69–142 (1993)

    Article  Google Scholar 

  18. Hosseini, H., Kannan, S., Zhang, B., Poovendran, R.: Deceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 (2017)

  19. Ignat, O., Castro, S., Miao, H., Li, W., Mihalcea, R.: WhyAct: identifying action reasons in lifestyle vlogs. In: EMNLP (2021)

    Google Scholar 

  20. Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: Tgif-QA: toward spatio-temporal reasoning in visual question answering. In: CVPR (2017)

    Google Scholar 

  21. Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)

    Google Scholar 

  22. Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR (2016)

    Google Scholar 

  23. Johnson, J., et al.: Image retrieval using scene graphs. In: CVPR (2015)

    Google Scholar 

  24. Jonker, R., Volgenant, A.: A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4), 325–340 (1987). https://doi.org/10.1007/BF02278710

    Article  MathSciNet  MATH  Google Scholar 

  25. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)

    Google Scholar 

  26. Kim, H., Zala, A., Bansal, M.: CoSIm: commonsense reasoning for counterfactual scene imagination. In: NAACL (2022)

    Google Scholar 

  27. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  28. Krahmer, E., Van Deemter, K.: Computational generation of referring expressions: a survey. Comput. Linguist. 38(1), 173–218 (2012)

    Article  Google Scholar 

  29. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2016). https://doi.org/10.1007/S11263-016-0981-7

  30. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  31. Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. In: ACL (2020)

    Google Scholar 

  32. Lei, J., Yu, L., Berg, T.L., Bansal, M.: What is more likely to happen next? video-and-language future event prediction. In: EMNLP (2020)

    Google Scholar 

  33. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)

    Google Scholar 

  34. Liu, J., et al.: Violin: a large-scale dataset for video-and-language inference. In: CVPR (2020)

    Google Scholar 

  35. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  36. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)

    Google Scholar 

  37. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)

    Google Scholar 

  38. Mitchell, M., et al.: Model cards for model reporting. In: FAccT (2019)

    Google Scholar 

  39. Niiniluoto, I.: Defending abduction. Philos. Sci. 66, S436–S451 (1999)

    Article  MathSciNet  Google Scholar 

  40. Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  41. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS (2011)

    Google Scholar 

  42. Ovchinnikova, E., Montazeri, N., Alexandrov, T., Hobbs, J.R., McCord, M.C., Mulkar-Mehta, R.: Abductive reasoning with a large knowledge base for discourse processing. In: IWCS (2011)

    Google Scholar 

  43. Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: ICCV (2019)

    Google Scholar 

  44. Park, J.S., Bhagavatula, C., Mottaghi, R., Farhadi, A., Choi, Y.: VisualCOMET: reasoning about the dynamic context of a still image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 508–524. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_30

    Chapter  Google Scholar 

  45. Paul, D., Frank, A.: Generating hypothetical events for abductive inference. In: *SEM (2021)

    Google Scholar 

  46. Peirce, C.S.: Philosophical Writings of Peirce, vol. 217. Courier Corporation (1955)

    Google Scholar 

  47. Peirce, C.S.: Pragmatism and Pragmaticism, vol. 5. Belknap Press of Harvard University Press (1965)

    Google Scholar 

  48. Pezzelle, S., Greco, C., Gandolfi, G., Gualdoni, E., Bernardi, R.: Be different to be better! a benchmark to leverage the complementarity of language and vision. In: Findings of EMNLP (2020)

    Google Scholar 

  49. Pirsiavash, H., Vondrick, C., Torralba, A.: Inferring the why in images. Technical report (2014)

    Google Scholar 

  50. Qin, L., et al.: Back to the future: unsupervised backprop-based decoding for counterfactual and abductive commonsense reasoning. In: EMNLP (2020)

    Google Scholar 

  51. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)

  52. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)

    Google Scholar 

  53. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: EMNLP (2019)

    Google Scholar 

  54. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)

    Google Scholar 

  55. Sap, M., Card, D., Gabriel, S., Choi, Y., Smith, N.A.: The risk of racial bias in hate speech detection. In: ACL (2019)

    Google Scholar 

  56. Shank, G.: The extraordinary ordinary powers of abductive reasoning. Theor. Psychol. 8(6), 841–860 (1998)

    Article  Google Scholar 

  57. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

  58. Shazeer, N., Stern, M.: Adafactor: adaptive learning rates with sublinear memory cost. In: ICML (2018)

    Google Scholar 

  59. Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: NeurIPS (2016)

    Google Scholar 

  60. Tafjord, O., Mishra, B.D., Clark, P.: ProofWriter: generating implications, proofs, and abductive statements over natural language. In: Findings of ACL (2021)

    Google Scholar 

  61. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)

    Google Scholar 

  62. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML (2019)

    Google Scholar 

  63. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)

    Google Scholar 

  64. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  65. Vedantam, R., Lin, X., Batra, T., Zitnick, C.L., Parikh, D.: Learning common sense through visual abstraction. In: ICCV (2015)

    Google Scholar 

  66. Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: fact-based visual question answering. TPAMI 40(10), 2413–2427 (2017)

    Article  Google Scholar 

  67. Wang, P., Wu, Q., Shen, C., Hengel, A.V.D., Dick, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017)

    Google Scholar 

  68. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP: System Demonstrations (2020)

    Google Scholar 

  69. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)

    Google Scholar 

  70. Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T.S., Sun, M.: CPT: colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797 (2021)

  71. Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. In: ICLR (2020)

    Google Scholar 

  72. Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual Madlibs: fill in the blank image generation and question answering. In: ICCV (2015)

    Google Scholar 

  73. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5

    Chapter  Google Scholar 

  74. Zadeh, A., Chan, M., Liang, P.P., Tong, E., Morency, L.P.: Social-iq: a question answering benchmark for artificial social intelligence. In: CVPR (2019)

    Google Scholar 

  75. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: CVPR (2019)

    Google Scholar 

  76. Zellers, R., et al.: MERLOT: multimodal neural script knowledge models. In: NeurIPS (2021)

    Google Scholar 

  77. Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: a dataset for relational and analogical visual reasoning. In: CVPR (2019)

    Google Scholar 

  78. Zhang, H., Huo, Y., Zhao, X., Song, Y., Roth, D.: Learning contextual causality from time-consecutive images. In: CVPR Workshops (2021)

    Google Scholar 

  79. Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)

    Google Scholar 

Download references

Acknowledgments

This work was funded by DARPA MCS program through NIWC Pacific (N66001-19-2-4031), the DARPA SemaFor program, and the Allen Institute for AI. AR was additionally in part supported by the DARPA PTG program, as well as BAIR’s industrial alliance program. We additionally thank the UC Berkeley Semafor group for the helpful discussions and feedback.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jack Hessel or Jena D. Hwang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1777 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hessel, J. et al. (2022). The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20059-5_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20058-8

  • Online ISBN: 978-3-031-20059-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics