Skip to main content
Log in

Visual Superordinate Abstraction for Robust Concept Learning

  • Research Article
  • Published:
Machine Intelligence Research Aims and scope Submit manuscript

Abstract

Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure to explore the intrinsic semantic hierarchy of visual concepts, e.g., {red, blue,⋯} ∈ “color” subspace yet cube ∈ “shape”. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces (i.e., visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5% for reasoning with perturbations and 15.6% for compositional generalization tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. B. Inhelder, J. Piaget. The early growth of logic in the child: Classification and seriation. Routledge, vol. 83, 2013. DOI: https://doi.org/10.4324/9781315009667.

  2. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh. VQA: Visual question answering. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2425–2433, 2015. DOI: https://doi.org/10.1109/ICCV.2015.279

  3. R. Zellers, Y. Bisk, A. Farhadi, Y. Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 6720–6731, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00688

  4. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, A. Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 3674–3683, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00387

  5. D. Mascharka, P. Tran, R. Soklaski, A. Majumdar. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 4942–4950, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00519

  6. K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, J. B. Tenenbaum. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In Proceedings of Advances in Neural Information Processing Systems, Montréal, Canada, vol. 31, 2018.

  7. J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, J. Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In Proceedings of International Conference on Learning Representations, New Orleans, USA, 2019.

  8. J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. L. Zitnick, R. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, pp. 2901–2910, 2017. DOI: https://doi.org/10.1109/CVPR.2017.215

  9. V. Marois, T. Jayram, V. Albouy, T. Kornuta, Y. Bouhadjar, A. S. Ozcan. On transfer learning using a mac model variant. In Proceedings of Workshop of Advances in Neural Information Processing Systems, Montréal, Canada, 2018.

  10. G. Murphy. The Big Book of Concepts. Cambridge, USA: MIT press, 2004. DOI: https://doi.org/10.7551/mitpress/1602.001.0001

    Google Scholar 

  11. T. K. Landauer, S. T. Dumais. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, vol. 104, no. 2, Article number 211, 1997. DOI: https://doi.org/10.1037/0033-295X.104.2.211.

  12. K. Lund, C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior research methods, instruments & computers, vol. 28, no. 2, pp. 203–208, 1996. DOI: https://doi.org/10.3758/BF03204766.

    Article  Google Scholar 

  13. B. M. Lake, G. L. Murphy. Word meaning in minds and machines. Psychological Review, to be published. DOI: https://doi.org/10.1037/rev0000297.

  14. J. B. Tenenbaum, C. Kemp, T. L. Griffiths, N. D. Goodman. How to grow a mind: Statistics, structure, and abstraction. Science, vol. 331, no. 6022, pp. 1279–1285, 2011. DOI: https://doi.org/10.1126/science.1192788.

    Article  MathSciNet  MATH  Google Scholar 

  15. E. Rosch, C. B. Mervis, W. D. Gray, D. M. Johnson, P. Boyes-Braem. Basic objects in natural categories. Cognitive psychology, vol. 8, no. 3, pp. 382–439, 1976. DOI: https://doi.org/10.1016/0010-0285(76)90013-X.

    Article  Google Scholar 

  16. J. W. Tanaka, M. Taylor. Object categories and expertise: Is the basic level in the eye of the beholder? Cognitive psychology, vol. 23, no. 3, pp. 457–482, 1991. DOI: https://doi.org/10.1016/0010-0285(91)90016-H.

    Article  Google Scholar 

  17. C. Han, J. Mao, C. Gan, J. B. Tenenbaum, J. Wu. Visual concept-metaconcept learning. In Proceedings of Advances in Neural Information Processing Systems, Vancouver, Canada, 2019.

  18. A. Li, K. Zhang, L. Wang. Zero-shot fine-grained classification by deep feature learning with semantics. International Journal of Automation and Computing, vol. 16, no. 5, pp. 563–574, 2019. DOI: https://doi.org/10.1007/s11633-019-1177-8.

    Article  Google Scholar 

  19. W. Zhu, W. Sun, X. Min, G. Zhai, X. Yang. Structured computational modeling of human visual system for no-reference image quality assessment. International Journal of Automation and Computing, vol. 18, no. 2, pp. 204–218, 2021. DOI: https://doi.org/10.1007/s11633-020-1270-z.

    Article  Google Scholar 

  20. J. Johnson, B. Hariharan, L. Van Der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, R. Girshick. Inferring and executing programs for visual reasoning. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 2989–2998, 2017. DOI: https://doi.org/10.1109/ICCV.2017.325

  21. K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: https://doi.org/10.1109/CVPR.2016.90

  22. R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 804–813, 2017. DOI: https://doi.org/10.1109/ICCV.2017.93

  23. R. Hu, J. Andreas, T. Darrell, K. Saenko. Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision, Springer, Munich, Germany, pp. 53–69, 2018. DOI: https://doi.org/10.1007/978-3-030-01234-2_4

    Google Scholar 

  24. Z. Chen, J. Mao, J. Wu, K. Wong, J. Tenenbaum, C. Gan. Grounding physical concepts of objects and events through dynamic visual reasoning. In Proceedings of International Conference on Learning Representations, Vienna, Austria, 2021.

  25. Q. Li, S. Huang, Y. Hong, S.-C. Zhu. A competence-aware curriculum for visual concepts learning via question answering. In Proceedings of the European Conference on Computer Vision, Springer, pp. 141–157, 2020. DOI: https://doi.org/10.1007/978-3-030-58536-5_9

  26. E. Perez, F. Strub, H. De Vries, V. Dumoulin, A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of AAAI Conference on Artificial Intelligence, New Orleans, USA, pp. 3942–3951, 2018. DOI: https://doi.org/10.1609/aaai.v32i1.11671

  27. D. A. Hudson, C. D. Manning. Compositional attention networks for machine reasoning. In Proceedings of International Conference on Learning Representations, Vancouver, Canada, 2018.

  28. Z. Wang, K. Wang, M. Yu, J. Xiong, W. Hwu, M. Hasegawa-Johnson, H. Shi. Interpretable visual reasoning via induced symbolic space. In Proceedings of IEEE International Conference on Computer Vision, Montréal, Canada, pp. 1878–1887, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00189

  29. A. Kamath, M. Singh, Y. LeCun, I. Misra, G. Synnaeve, N. Carion. Mdetr-modulated detection for end-to-end multimodal understanding. In Proceedings of IEEE International Conference on Computer Vision, Montréal, Canada, pp. 1780–1790, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00180

  30. J. Pearl. Causal inference in statistics: An overview. Statistics Surveys, vol. 3, pp. 96–146, 2009. DOI: https://doi.org/10.1214/09-SS057.

    Article  MathSciNet  MATH  Google Scholar 

  31. G. Dunn, R. Emsley, H. Liu, S. Landau, J. Green, I. White, A. Pickles. Evaluation and validation of social and psychological markers in randomised trials of complex interventions in mental health: a methodological research programme. Health Technology Assessment, Winchester, England, vol. 19, no. 93, pp. 1–115, 2015. DOI: https://doi.org/10.3310/hta19930.

    Article  Google Scholar 

  32. B. G. King. A political mediation model of corporate response to social movement activism. Administrative Science Quarterly, vol. 53, no. 3, pp. 395–421, 2008. DOI: https://doi.org/10.2189/asqu.53.3.395.

    Article  Google Scholar 

  33. D. P. MacKinnon, A. J. Fairchild, M. S. Fritz. Mediation analysis. Annual Review of Psychology, vol. 58, pp. 593–614, 2007. DOI: https://doi.org/10.1146/annurev.psych.58.110405.085542.

    Article  Google Scholar 

  34. L. Richiardi, R. Bellocco, D. Zugna. Mediation analysis in epidemiology: methods, interpretation and bias. International Journal of Epidemiology, vol. 42, no. 5, pp. 1511–1519, 2013. DOI: https://doi.org/10.1093/ije/dyt127.

    Article  Google Scholar 

  35. S. Nair, Y. Zhu, S. Savarese, L. Fei-Fei. Causal induction from visual observations for goal directed tasks. [Online], Available: https://arxiv.org/abs/1910.01751.

  36. Y. Niu, K. Tang, H. Zhang, Z. Lu, X.-S. Hua, J.-R. Wen. Counterfactual VQA: A cause-effect look at language bias. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Nashville, USA, pp. 12700–12710, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01251.

  37. J. Qi, Y. Niu, J. Huang, H. Zhang. Two causal principles for improving visual dialog. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 10860–10869, 2020. DOI: https://doi.org/10.1109/CV-PR42600.2020.01087

  38. T. Wang, J. Huang, H. Zhang, Q. Sun. Visual common-sense R-CNN. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 10760–10770, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01077

  39. X. Yang, H. Zhang, J. Cai. Deconfounded image captioning: A causal retrospect. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published, 2022. DOI: https://doi.org/10.1109/TPAMI.2021.3121705.

  40. K. Tang, Y. Niu, J. Huang, J. Shi, H. Zhang. Unbiased scene graph generation from biased training. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 3716–3725, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00377

  41. I. Loshchilov, F. Hutter. Decoupled weight decay regularization. In Proceedings of International Conference on Learning Representations, New Orleans, USA, 2019.

Download references

Acknowledgements

This work was supported in part by the Australian Research Council (ARC) (Nos. FL-170100117, DP-1801 03424, IC-190100031 and LE-200100049).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Zheng.

Additional information

Qi Zheng received the B. Eng. degree and M.Phil, degree in electronic information engineering from Huazhong University of Science and Technology, China in 2016 and 2019, respectively. She is currently a Ph.D. degree candidate in computer science at University of Sydney, Australia.

Her research interests include multimodal learning and scene understanding.

Chao-Yue Wang received the B. Eng. degree in information engineering from Tianjin University, China in 2014, and the Ph.D. degree in information technology from University of Technology Sydney, Australia in 2018. He was a postdoctoral researcher in machine learning and computer vision at School of Computer Science, University of Sydney, Australia. He is currently is a research scientist at JD Explore Academy (JD.com). His research outcomes have been published in prestigious journals and prominent conferences, such as IEEE T-PAMI, IEEE T-EVC, IEEE T-IP, NeurIPS, CVPR, ECCV, IJ-CAI. He received the Distinguished Student Paper Award in the 2017 International Joint Conference on Artificial Intelligence (IJ-CAI-17).

His research interests include developing deep learning techniques to solve real-world challenges, such as image synthesis/editing, controllable video generation, image/video enhancement, and medical image processing.

Dadong Wang received the B.Eng. degree in mechanical engineering and M. Eng. and Ph.D. degrees in AI in machine fault diagnosis from University of Science and Technology, China, in 1990, 1993, and 1997, respectively. Then he received Ph.D. degree in AI in process optimization from University of Wollongong, Australia in 2002. He is a principal research scientist & the leader of the Commonwealth Scientific and Industrial Research Organisation (CSIRO) Quantitative Imaging Research Team, part of the CSIRO Data61, and a conjoint professor at University of New South Wales (UNSW) and an adjunct professor at the University of Technology, Sydney (UTS). Prior to joining the CSIRO in 2005, he had worked for two multinational companies for six years, developing large intelligent systems for monitoring and control. He has published over 150 research papers, book chapters and reports. His research team has been the recipient of Research Achievement Awards by CSIRO, the Engineering Excellence Award by Engineers Australia, R&D category of NSW, Queensland and ACT iAwards. He has been developing automated image analysis solutions for scientific and industrial applications, with the aim of increasing both quality and quantity of information extracted from multi-dimensional image data.

His research interests include image analysis, computer vision, artificial intelligence, signal processing and software engineering.

Da-Cheng Tao received the B.Eng. degree in electronic information engineering from University of Science and Technology of China, in 2002, the M.Phil, degree in information engineering from the Chinese University of Hong Kong, China in 2004, and the Ph. D. degree in computer science and information systems from University of London, UK in 2007. He is a professor of computer science and an ARC laureate fellow with School of Computer Science and the Faculty of Engineering, University of Sydney, Australia. He mainly applies statistics and mathematics to artificial intelligence and data science. His research is detailed in one monograph and over 200 publications in prestigious journals and proceedings at prominent conferences such as IEEE TPAMI, TIP, TNNLS, IJCV, JMLR, NIPS, ICML, CVPR, ICCV, ECCV, AAAI, IJCAI, ICDM and ACM SIGKDD, with several best paper awards, such as the Best Theory/Algorithm Paper Runner Up Award at IEEE ICDM’07, the Distinguished Paper Award at 2018 IJCAI, the 2014 ICDM 10-year Highest-Impact Paper Award, and the 2017 IEEE Signal Processing Society Best Paper Award. He received the 2015 Australian Scopus-Eureka Prize and the 2018 IEEE ICDM Research Contributions Award. He is a fellow of the Australian Academy of Science, AAAS, ACM and IEEE.

His research interests include applying statistics and mathematics to artificial intelligence and data science.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, Q., Wang, CY., Wang, D. et al. Visual Superordinate Abstraction for Robust Concept Learning. Mach. Intell. Res. 20, 79–91 (2023). https://doi.org/10.1007/s11633-022-1360-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-022-1360-1

Keywords

Navigation