Visual Intelligence through Human Interaction

Krishna, Ranjay; Gordon, Mitchell; Fei-Fei, Li; Bernstein, Michael

doi:10.1007/978-3-030-82681-9_9

Ranjay Krishna⁴,
Mitchell Gordon⁴,
Li Fei-Fei⁴ &
…
Michael Bernstein⁴

Part of the book series: Human–Computer Interaction Series ((HCIS))

2505 Accesses
3 Citations

Abstract

Over the last decade, Computer Vision, the branch of Artificial Intelligence aimed at understanding the visual world, has evolved from simply recognizing objects in images to describing pictures, answering questions about images, aiding robots maneuver around physical spaces, and even generating novel visual content. As these tasks and applications have modernized, so too has the reliance on more data, either for model training or for evaluation. In this chapter, we demonstrate that novel interaction strategies can enable new forms of data collection and evaluation for Computer Vision. First, we present a crowdsourcing interface for speeding up paid data collection by an order of magnitude, feeding the data-hungry nature of modern vision models. Second, we explore a method to increase volunteer contributions using automated social interventions. Third, we develop a system to ensure human evaluation of generative vision models are reliable, affordable, and grounded in psychophysics theory. We conclude with future opportunities for Human–Computer Interaction to aid Computer Vision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Applications can be found at https://taptapsee.com/, https://www.bemyeyes.com/, and https://camfindapp.com/.
2.
The dataset of social media posts and social strategies for training the reinforcement learning model, as well as the trained contextual bandit model, is publicly available at http://cs.stanford.edu/people/ranjaykrishna/socialstrategies.
3.
We explicitly reveal this ratio to evaluators. Amazon Mechanical Turk forums would enable evaluators to discuss and learn about this distribution over time, thus altering how different evaluators would approach the task. By making this ratio explicit, evaluators would have the same prior entering the task.
4.
Hyper-realism is relative to the real dataset on which a model is trained. Some datasets already look less realistic because of lower resolution and/or lower diversity of images.

References

Adadi A, Berrada M (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE Access 6:52138–52160
Article Google Scholar
Ambati V, Vogel S, Carbonell J (2011) Towards task recommendation in micro-task markets
Google Scholar
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Google Scholar
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Google Scholar
Barratt S, Sharma R (2018) A note on the inception score. arXiv:1801.01973
Bernstein MS, Brandt J, Miller RC, Karger DR (2011) Crowds in two seconds: enabling realtime crowd-powered interfaces. In: Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, pp 33–42
Google Scholar
Bernstein MS, Little G, Miller RC, Hartmann B, Ackerman MS, Karger DR, Crowell D, Panovich K (2010) Soylent: a word processor with a crowd inside. In: Proceedings of the 23nd annual ACM symposium on user interface software and technology. ACM, pp 313–322
Google Scholar
Berthelot D, Schumm T, Metz L (2017) Began: boundary equilibrium generative adversarial networks. arXiv:1703.10717
Bigham JP, Jayant C, Ji H, Little G, Miller A, Miller RC, Miller R, Tatarowicz A, White B, White S, et al (2010) Vizwiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM, pp 333–342
Google Scholar
Bińkowski M, Sutherland DJ, Arbel M, Gretton A (2018) Demystifying mmd gans. arXiv:1801.01401
Bishop CM (2006) Pattern recognition and machine learning. Springer
Google Scholar
Biswas A, Parikh D (2013) Simultaneous active learning of classifiers & attributes via relative feedback. In: 2013 Ieee conference on computer vision and pattern recognition (CVPR). IEEE, pp 644–651
Google Scholar
Bohus D, Rudnicky AI (2009) The ravenclaw dialog management framework: architecture and systems. Comput Speech Lang 23(3):332–361
Article Google Scholar
Borji A (2018) Pros and cons of gan evaluation measures. In: Computer vision and image understanding
Google Scholar
Brady E, Morris MR, Bigham JP (2015) Gauging receptiveness to social microvolunteering. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems, CHI ’15. ACM, New York, NY, USA, pp 1055–1064
Google Scholar
Brady EL, Zhong Y, Morris MR, Bigham JP (2013) Investigating the appropriateness of social network question asking as a resource for blind users. In: Proceedings of the 2013 conference on computer supported cooperative work. ACM, pp 1225–1236
Google Scholar
Bragg J, Daniel M, Weld DS (2013) Crowdsourcing multi-label classification for taxonomy creation. In: First AAAI conference on human computation and crowdsourcing
Google Scholar
Branson S, Hjorleifsson KE, Perona P (2014) Active annotation translation. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3702–3709
Google Scholar
Branson S, Wah C, Schroff F, Babenko B, Welinder P, Perona P, Belongie S (2010) Visual recognition with humans in the loop. In: Computer vision–ECCV 2010. Springer, pp 438–451
Google Scholar
Broadbent DE, Broadbent MHP (1987) From detection to identification: response to multiple targets in rapid serial visual presentation. Percept Psychophys 42(2):105–113
Article Google Scholar
Brock A, Donahue J, Simonyan K (2018) Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096
Buçinca Z, Lin P, Gajos KZ, Glassman EL (2020) Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems. In: Proceedings of the 25th international conference on intelligent user interfaces, pp 454–464
Google Scholar
Buolamwini J, Gebru T (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In: Conference on fairness, accountability and transparency, pp 77–91
Google Scholar
Burke M, Kraut RE, Joyce E (2014) Membership claims and requests: some newcomer socialization strategies in online communities. Small Group Research
Google Scholar
Burke M, Kraut R (2013) Using facebook after losing a job: Differential benefits of strong and weak ties. In: Proceedings of the 2013 conference on computer supported cooperative work. ACM, pp 1419–1430
Google Scholar
Card SK, Newell A, Moran TP (1983) The psychology of human-computer interaction
Google Scholar
Carroll M, Shah R, Ho MK, Griffiths T, Seshia S, Abbeel P, Dragan A (2019) On the utility of learning about humans for human-ai coordination. In: Advances in neural information processing systems, pp 5174–5185
Google Scholar
Cassell J, Thórisson KR (1999) The power of a nod and a glance: envelope vs. emotional feedback in animated conversational agents. Appl Artif Intell 13:519–538
Article Google Scholar
Cerrato L, Ekeklint S (2002) Different ways of ending human-machine dialogues
Google Scholar
Chaiken S (1989) Heuristic and systematic information processing within and beyond the persuasion context. In: Unintended thought, pp 212–252
Google Scholar
Chellappa R, Sinha P, Jonathon Phillips P (2010) Face recognition by computers and humans. Computer 43(2):46–55
Article Google Scholar
Cheng J, Teevan J, Bernstein MS (2015) Measuring crowdsourcing effort with error-time curves. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems. ACM, pp 1365–1374
Google Scholar
Chidambaram V, Chiang Y-H, Mutlu B (2012) Designing persuasive robots: how robots might persuade people using vocal and nonverbal cues. In: Proceedings of the seventh annual ACM/IEEE international conference on human-robot interaction. ACM, pp 293–300
Google Scholar
Chilton LB, Little G, Edge D, Weld DS, Landay JA (2013) Cascade: crowdsourcing taxonomy creation. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 1999–2008
Google Scholar
Cialdini R (2016) Pre-suasion: a revolutionary way to influence and persuade. Simon and Schuster
Google Scholar
Colligan L, Potts HWW, Finn CT, Sinkin RA (2015) Cognitive workload changes for nurses transitioning from a legacy system with paper documentation to a commercial electronic health record. Int J Med Inform 84(7):469–476
Article Google Scholar
Cornsweet TN (1962) The staircrase-method in psychophysics
Google Scholar
Corti K, Gillespie A (2016) Co-constructing intersubjectivity with artificial conversational agents: people are more likely to initiate repairs of misunderstandings with agents represented as human. Comput Hum Behav 58:431–442
Article Google Scholar
Dakin SC, Omigie D (2009) Psychophysical evidence for a non-linear representation of facial identity. Vis Res 49(18):2285–2296
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, pp 886–893
Google Scholar
Darley JM, Latané B (1968) Bystander intervention in emergencies: diffusion of responsibility. J Personal Soc Psychol 8(4p1):377
Article Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee, pp 248–255
Google Scholar
Deng J, Russakovsky O, Krause J, Bernstein MS, Berg A, Fei-Fei L (2014) Scalable multi-label annotation. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 3099–3102
Google Scholar
Denton EL, Chintala S, Fergus R, et al (2015) Deep generative image models using a laplacian pyramid of adversarial networks. In: Advances in neural information processing systems, pp 1486–1494
Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Difallah DE, Demartini G, Cudré-Mauroux P (2013) Pick-a-crowd: tell me what you like, and i’ll tell you what to do. In: Proceedings of the 22nd international conference on world wide web, WWW ’13. ACM, New York, NY, USA, pp 367–374
Google Scholar
Dragan AD, Lee KCT, Srinivasa SS (2013) Legibility and predictability of robot motion. In: 2013 8th ACM/IEEE international conference on human-robot interaction (HRI). IEEE, pp 301–308
Google Scholar
Fast E, Chen B, Mendelsohn J, Bassen J, Bernstein MS (2018) Iris: a conversational agent for complex tasks. In: Proceedings of the 2018 CHI conference on human factors in computing systems. ACM, p 473
Google Scholar
Fast E, Steffee D, Wang L, Brandt JR, Bernstein MS (2014) Emergent, crowd-scale programming practice in the ide. In: Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, pp 2491–2500
Google Scholar
Fei-Fei L, Iyer A, Koch C, Perona P (2007) What do we perceive in a glance of a real-world scene? J Vis 7(1):10
Article Google Scholar
Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39(4):783–791
Article Google Scholar
Ferrara E, Varol O, Davis C, Menczer F, Flammini A (2016) The rise of social bots. Commun ACM 59(7):96–104
Article Google Scholar
Fraisse P (1984) Perception and estimation of time. Ann Rev Psychol 35(1):1–37
Article Google Scholar
Geiger D, Schader M (2014) Personalized task recommendation in crowdsourcing information systems – current state of the art. Decis Support Syst 65:3–16. Crowdsourcing and Social Networks Analysis
Google Scholar
Gilbert E, Karahalios K (2009) Predicting tie strength with social media. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 211–220
Google Scholar
Gillund G, Shiffrin RM (1984) A retrieval model for both recognition and recall. Psychol Rev 91(1):1
Article Google Scholar
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 580–587
Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Google Scholar
Gray M, Suri S (2019) Ghost work: how to stop silicon valley from building a new global underclass. Eamon Dolan
Google Scholar
Greene MR, Oliva A (2009) The briefest of glances: the time course of natural scene understanding. Psychol Sci 20(4):464–472
Article Google Scholar
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777
Google Scholar
Haque A, Milstein A, Fei-Fei L (2020) Illuminating the dark spaces of healthcare with ambient intelligence. Nature 585(7824):193–202
Article Google Scholar
Hashimoto TB, Zhang H, Liang P (2019) Unifying human and statistical evaluation for natural language generation. arXiv:1904.02792
Hata K, Krishna R, Fei-Fei L, Bernstein MS (2017) A glimpse far into the future: understanding long-term crowd worker quality. In: Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing. ACM, pp 889–901
Google Scholar
Healy K, Schussman A (2003) The ecology of open-source software development. Technical report, Technical report, University of Arizona, USA
Google Scholar
Hempel J (2015) Facebook launches m, its bold answer to siri and cortana. In: Wired. Retrieved January 1:2017
Google Scholar
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in neural information processing systems, pp 6626–6637
Google Scholar
Hill BM (2013) Almost wikipedia: eight early encyclopedia projects and the mechanisms of collective action. Massachusetts institute of technology, pp 1–38
Google Scholar
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Article MATH Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hoffman ML (1981) Is altruism part of human nature? J Personal Soc Psychol 40(1):121
Article Google Scholar
Horvitz E (1999) Principles of mixed-initiative user interfaces. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 159–166
Google Scholar
Huang F, Canny JF (2019) Sketchforme: composing sketched scenes from text descriptions for interactive applications. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology, pp 209–220
Google Scholar
Huang T-HK, Chang J, Bigham J (2018) Evorus: a crowd-powered conversational assistant built to automate itself over time. In: Proceedings of the 2018 CHI conference on human factors in computing systems. ACM, p 295
Google Scholar
Hutto CJ, Gilbert E (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Eighth international AAAI conference on weblogs and social media
Google Scholar
Iordan MC, Greene MR, Beck DM, Fei-Fei L (2015) Basic level category structure emerges gradually across human ventral visual cortex. In: Journal of cognitive neuroscience
Google Scholar
Ipeirotis PG (2010) Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads. The ACM Mag Stud 17(2):16–21
Google Scholar
Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation. ACM, pp 64–67
Google Scholar
Irani LC, Silberman M (2013) Turkopticon: interrupting worker invisibility in amazon mechanical turk. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 611–620
Google Scholar
Jain SD, Grauman K (2013) Predicting sufficient annotation strength for interactive foreground segmentation. In: 2013 IEEE international conference on computer vision (ICCV). IEEE, pp 1313–1320
Google Scholar
Jain U, Weihs L, Kolve E, Farhadi A, Lazebnik S, Kembhavi A, Schwing A (2020) A cordial sync: Going beyond marginal policies for multi-agent embodied tasks. In: European conference on computer vision. Springer, pp 471–490
Google Scholar
Jean N, Burke M, Xie M, Davis WM, Lobell DB, Ermon S (2016) Combining satellite imagery and machine learning to predict poverty. Science 353(6301):790–794
Article Google Scholar
Josephy T, Lease M, Paritosh P (2013) Crowdscale 2013: crowdsourcing at scale workshop report
Google Scholar
Kamar E, Hacker S, Horvitz E (2012) Combining human and machine intelligence in large-scale crowdsourcing. In: Proceedings of the 11th international conference on autonomous agents and multiagent systems-volume 1. International Foundation for Autonomous Agents and Multiagent Systems, pp 467–474
Google Scholar
Karger DR, Oh S, Shah D (2011) Budget-optimal crowdsourcing using low-rank matrix approximations. In: 2011 49th annual allerton conference on communication, control, and computing (allerton). IEEE, pp 284–291
Google Scholar
Karger DR, Oh S (2014) Shah D Budget-optimal task allocation for reliable crowdsourcing systems. Oper Res 62(1):1–24
Article MATH Google Scholar
Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv:1710.10196
Karras T, Laine S, Aila T (2018) A style-based generator architecture for generative adversarial networks. arXiv:1812.04948
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4401–4410
Google Scholar
Khadpe P, Krishna R, Fei-Fei L, Hancock JT, Bernstein MS (2020) Conceptual metaphors impact perceptions of human-ai collaboration. Proc ACM Hum-Comput Interact 4(CSCW2):1–26
Article Google Scholar
Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 453–456
Google Scholar
Klein SA (2001) Measuring, estimating, and understanding the psychometric function: a commentary. Percept Psychophys 63(8):1421–1455
Article Google Scholar
Kramer ADI, Guillory JE, Hancock JT (2014) Experimental evidence of massive-scale emotional contagion through social networks. Proc Natl Acad Sci 111(24):8788–8790
Article Google Scholar
Kraut RE, Resnick P (2011) Encouraging contribution to online communities. Building successful online communities: evidence-based social design, pp 21–76
Google Scholar
Krishna R, Bernstein M, Fei-Fei L (2019) Information maximizing visual question generation. In: IEEE conference on computer vision and pattern recognition
Google Scholar
Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706–715
Google Scholar
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Krishna RA, Hata K, Chen S, Kravitz J, Shamma DA, Fei-Fei L, Bernstein MS (2016) Embracing error to enable rapid crowdsourcing. In: Proceedings of the 2016 CHI conference on human factors in computing systems. ACM, pp 3167–3179
Google Scholar
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report, Citeseer
Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25. Curran Associates, Inc., pp 1097–1105
Google Scholar
Krueger GP (1989) Sustained work, fatigue, sleep loss and performance: a review of the issues. Work Stress 3(2):129–141
Article Google Scholar
Kumar R, Satyanarayan A, Torres C, Lim M, Ahmad S, Klemmer SR, Talton JO (2013) Webzeitgeist: design mining the web. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 3083–3092
Google Scholar
Kurakin A, Goodfellow I, Bengio S (2016) Adversarial examples in the physical world. arXiv:1607.02533
Kwon M, Biyik E, Talati A, Bhasin K, Losey DP, Sadigh D (2020) When humans aren’t optimal: robots that collaborate with risk-aware humans. In: Proceedings of the 2020 ACM/IEEE international conference on human-robot interaction, pp 43–52
Google Scholar
Laielli M, Smith J, Biamby G, Darrell T, Hartmann B (2019) Labelar: a spatial guidance interface for fast computer vision image collection. In: Proceedings of the 32nd annual ACM symposium on user interface software and technology, pp 987–998
Google Scholar
Langer EJ, Blank A, Chanowitz B (1978) The mindlessness of ostensibly thoughtful action: the role of “placebic’’ information in interpersonal interaction. J Personal Soc Psychol 36(6):635
Article Google Scholar
Laput G, Lasecki WS, Wiese J, Xiao R, Bigham JP, Harrison C (2015) Zensors: adaptive, rapidly deployable, human-intelligent sensor feeds. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems. ACM, pp 1935–1944
Google Scholar
Lasecki W, Miller C, Sadilek A, Abumoussa A, Borrello D, Kushalnagar R, Bigham J (2012) Real-time captioning by groups of non-experts. In: Proceedings of the 25th annual ACM symposium on user interface software and technology. ACM, pp 23–34
Google Scholar
Lasecki WS, Murray KI, White S, Miller RC, Bigham JP (2011) Real-time crowd control of existing interfaces. In: Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, pp 23–32
Google Scholar
Lasecki WS, Wesley R, Nichols J, Kulkarni A, Allen JF, Bigham JP (2013) Chorus: a crowd-powered conversational assistant. In: Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, pp 151–162
Google Scholar
Law E, Yin M, Goh J, Chen K, Terry MA, Gajos KZ (2016) Curiosity killed the cat, but makes crowdwork better. In: Proceedings of the 2016 CHI conference on human factors in computing systems. ACM, pp 4098–4110
Google Scholar
Le J, Edmonds A, Hester V, Biewald L (2010) Ensuring quality in crowdsourced search relevance evaluation: the effects of training question distribution. In: SIGIR 2010 workshop on crowdsourcing for search evaluation, vol 2126, pp 22–32
Google Scholar
Levitt HCCH (1971) Transformed up-down methods in psychoacoustics. J Acoust Soc Am 49(2B):467–477
Article Google Scholar
Lewis DD, Hayes PJ (1994) Guest editorial. ACM Trans Inf Syst 12(3):231 July
Google Scholar
Li FF, VanRullen R, Koch C, Perona P (2002) Rapid natural scene categorization in the near absence of attention. Proc Natl Acad Sci 99(14):9596–9601
Article Google Scholar
Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th international conference on world wide web. ACM, pp 661–670
Google Scholar
Li T, Ogihara M (2003) Detecting emotion in music. In: ISMIR, vol 3, pp 239–240
Google Scholar
Liang L, Grauman K (2014) Beyond comparing image pairs: setwise active learning for relative attributes. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 208–215
Google Scholar
Lin C, Kamar E, Horvitz E (2014) Signals in the silence: models of implicit feedback in a recommendation system for crowdsourcing
Google Scholar
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Lawrence Zitnick C (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV 2014. Springer, pp 740–755
Google Scholar
Lintott CJ, Schawinski K, Slosar A, Land K, Bamford S, Thomas D, Raddick MJ, Nichol RC, Szalay A, Andreescu D et al (2008) Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey. Mon Not R Astron Soc 389(3):1179–1189
Article Google Scholar
Liu A, Soderland S, Bragg J, Lin CH, Ling X, Weld DS (2016) Effective crowd annotation for relation extraction. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 897–906
Google Scholar
Liu Z, Luo P, Wang X, Tang X (2015) Deep learning face attributes in the wild. In: Proceedings of international conference on computer vision (ICCV)
Google Scholar
Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision. Ieee, vol 2, pp 1150–1157
Google Scholar
Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European conference on computer vision. Springer, pp 852–869
Google Scholar
Lucic M, Kurach K, Michalski M, Gelly S, Bousquet O (2018) Are gans created equal? a large-scale study. In: Advances in neural information processing systems, pp 698–707
Google Scholar
Mani I (1999) Advances in automatic text summarization. MIT press
Google Scholar
Marcus A, Parameswaran A (2015) Crowdsourced data management: industry and academic perspectives. Foundations and Trends in Databases
Google Scholar
Markey PM (2000) Bystander intervention in computer-mediated communication. Comput Hum Behav 16(2):183–188
Article Google Scholar
Martin D, Hanrahan BV, O’Neill J, Gupta N (2014) Being a turker. In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing. ACM, pp 224–235
Google Scholar
Mason W, Suri S (2012) Conducting behavioral research on amazon’s mechanical turk. Behav Res Methods 44(1):1–23
Article Google Scholar
Mildenhall B, Srinivasan PP, Tancik M, Barron JT, Ramamoorthi R, Ng R (2020) Nerf: representing scenes as neural radiance fields for view synthesis. arXiv:2003.08934
Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1):1–28
Article Google Scholar
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T (2019) Model cards for model reporting. In: Proceedings of the conference on fairness, accountability, and transparency, pp 220–229
Google Scholar
Mitra T, Hutto CJ, Gilbert E (2015) Comparing person-and process-centric strategies for obtaining quality data on amazon mechanical turk. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems. ACM, pp 1345–1354
Google Scholar
Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv:1802.05957
Nass C, Brave S (2007) Wired for speech: how voice activates and advances the human-computer relationship. The MIT Press
Google Scholar
Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318
Article Google Scholar
Olsson C, Bhupatiraju S, Brown T, Odena A, Goodfellow I (2018) Skill rating for generative models. arXiv:1808.04888
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135
Article Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
Google Scholar
Park J, Krishna R, Khadpe P, Fei-Fei L, Bernstein M (2019) Ai-based request augmentation to increase crowdsourcing participation. Proc AAAI Conf Hum Comput Crowdsourcing 7:115–124
Google Scholar
Parkash A, Parikh D (2012) Attributes for classifier feedback. In: Computer vision–ECCV 2012. Springer, pp 354–368
Google Scholar
Peng Dai MD, Weld S (2010) Decision-theoretic control of crowd-sourced workflows. In: In the 24th AAAI conference on artificial intelligence (AAAI’10. Citeseer
Google Scholar
Portilla J, Simoncelli EP (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. Int J Comput Vis 40(1):49–70
Article MATH Google Scholar
Potter MC (1976) Short-term conceptual memory for pictures. J Exp Psychol Hum Learn Mem 2(5):509
Article MathSciNet Google Scholar
Potter MC, Levy EI (1969) Recognition memory for a rapid sequence of pictures. J Exp Psychol 81(1):10
Article Google Scholar
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434
Ravuri S, Mohamed S, Rosca M, Vinyals O (2018) Learning implicit generative models with the method of learned moments. arXiv:1806.11006
Rayner K, Smith TJ, Malcolm GL, Henderson JM (2009) Eye movements and visual encoding during scene perception. Psychol Sci 20(1):6–10
Article Google Scholar
Reeves A, Sperling G (1986) Attention gating in short-term visual memory. Psychol Rev 93(2):180
Article Google Scholar
Reeves B, Nass CI (1996) The media equation: how people treat computers, television, and new media like real people and places. Cambridge university press
Google Scholar
Reich J, Murnane R, Willett J (2012) The state of wiki usage in us k–12 schools: Leveraging web 2.0 data warehouses to assess quality and equity in online learning environments. Educ Res 41(1):7–15
Article Google Scholar
Robert C (1984) Influence: the psychology of persuasion. William Morrow and Company, Nowy Jork
Google Scholar
Rosca M, Lakshminarayanan B, Warde-Farley D, Mohamed S (2017) Variational approaches for auto-encoding generative adversarial networks. arXiv:1706.04987
Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M (2019) Faceforensics++: learning to detect manipulated facial images. arXiv:1901.08971
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Li F-F (2014) Imagenet large scale visual recognition challenge. In: International Journal of Computer Vision, pp 1–42
Google Scholar
Russakovsky O, Li L-J, Fei-Fei L (2015) Best of both worlds: human-machine collaboration for object annotation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2121–2131
Google Scholar
Rzeszotarski JM, Chi E, Paritosh P, Dai P (2013) Inserting micro-breaks into crowdsourcing workflows. In: First AAAI conference on human computation and crowdsourcing
Google Scholar
Sajjadi MSM, Bachem O, Lucic M, Bousquet O, Gelly S (2018) Assessing generative models via precision and recall. In: Advances in neural information processing systems, pp 5228–5237
Google Scholar
Salehi N, Irani LC, Bernstein MS (2015) We are dynamo: overcoming stalling and friction in collective action for crowd workers. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems. ACM, pp 1621–1630
Google Scholar
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242
Google Scholar
Sardar A, Joosse M, Weiss A, Evers V (2012) Don’t stand so close to me: users’ attitudinal and behavioral responses to personal space invasion by robots. In: Proceedings of the seventh annual ACM/IEEE international conference on human-robot interaction. ACM, pp 229–230
Google Scholar
Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2):135–168
Article MATH Google Scholar
Seetharaman P, Pardo B (2014) Crowdsourcing a reverberation descriptor map. In: Proceedings of the ACM international conference on multimedia. ACM, pp 587–596
Google Scholar
Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 614–622
Google Scholar
Sheshadri A, Lease M (2013) Square: a benchmark for research on computing crowd consensus. In: First AAAI conference on human computation and crowdsourcing
Google Scholar
Shneiderman B, Maes P (1997) Direct manipulation vs. interface agents. Interactions 4(6):42–61 November
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556
Google Scholar
Smyth P, Burl MC, Fayyad UM, Perona P (1994) Knowledge discovery in large image databases: dealing with uncertainties in ground truth. In: KDD workshop, pp 109–120
Google Scholar
Smyth P, Fayyad U, Burl M, Perona P, Baldi P (1995) Inferring ground truth from subjective labelling of venus images
Google Scholar
Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 254–263
Google Scholar
Song Z, Chen Q, Huang Z, Hua Y, Yan S (2011) Contextualizing object detection and classification. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1585–1592
Google Scholar
Sperling G (1963) A model for visual memory tasks. Hum Factors 5(1):19–31
Article Google Scholar
Su H, Deng J, Fei-Fei L (2012) Crowdsourcing annotations for visual object detection. In: Workshops at the twenty-sixth AAAI conference on artificial intelligence
Google Scholar
Suchman LA (1987) Plans and situated actions: the problem of human-machine communication. Cambridge University Press, Cambridge
Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Google Scholar
Tamuz O, Liu C, Belongie S, Shamir O, Kalai AT (2011) Adaptively learning the crowd kernel. arXiv:1105.1033
Taylor PJ, Thomas S (2008) Linguistic style matching and negotiation outcome. Negot Confl Manag Res 1(3):263–281
Google Scholar
Theis L, van den Oord A, Bethge M (2015) A note on the evaluation of generative models. arXiv:1511.01844
Thomaz AL, Breazeal C (2008) Teachable robots: understanding human teaching behavior to build more effective robot learners. Artif Intell 172(6–7):716–737
Article Google Scholar
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L-J (2016) Yfcc100m: the new data in multimedia research. Commun ACM 59(2). To Appear
Google Scholar
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Google Scholar
Vijayanarasimhan S, Jain P, Grauman K (2010) Far-sighted active learning on a budget for image and video recognition. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3035–3042
Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. arXiv:1411.4555
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Google Scholar
von Ahn L, Dabbish L (2004) Labeling images with a computer game. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, pp 319–326
Google Scholar
von Ahn L, Dabbish L (2004) Labeling images with a computer game, pp 319–326
Google Scholar
Vondrick C, Patterson D, Ramanan D (2013) Efficiently scaling up crowdsourced video annotation. Int J Comput Vis 101(1):184–204
Article Google Scholar
Wah C, Branson S, Perona P, Belongie S (2011) Multiclass recognition and part localization with humans in the loop. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, pp 2524–2531
Google Scholar
Wah C, Van Horn G, Branson S, Maji S, Perona P, Belongie S (2014) Similarity comparisons for interactive fine-grained categorization. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 859–866
Google Scholar
Wang Y-C, Kraut RE, Levine JM (2015) Eliciting and receiving online support: using computer-aided content analysis to examine the dynamics of online social support. J Med Internet Res 17(4):e99
Google Scholar
Warde-Farley D, Bengio Y (2016) Improving generative adversarial networks with denoising feature matching
Google Scholar
Warncke-Wang M, Ranjan V, Terveen L, Hecht B (2015) Misalignment between supply and demand of quality content in peer production communities. In: Ninth international AAAI conference on web and social media
Google Scholar
Weichselgartner E, Sperling G (1987) Dynamics of automatic and controlled visual attention. Science 238(4828):778–780
Article Google Scholar
Weld DS, Lin CH, Bragg J (2015) Artificial intelligence and collective intelligence. In: Handbook of collective intelligence, pp. 89–114
Google Scholar
Welinder P, Branson S, Perona P, Belongie SJ (2010) The multidimensional wisdom of crowds. In: Advances in neural information processing systems, pp 2424–2432
Google Scholar
Whitehill J, Wu T-f, Bergsma J, Movellan JR, Ruvolo PL (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in neural information processing systems, pp 2035–2043
Google Scholar
Wichmann FA, Jeremy Hill N (2001) The psychometric function: I. Fitting, sampling, and goodness of fit. Percept Psychophys 63(8):1293–1313
Article Google Scholar
Willis CG, Law E, Williams AC, Franzone BF, Bernardos R, Bruno L, Hopkins C, Schorn C, Weber E, Park DS et al (2017) Crowdcurio: an online crowdsourcing platform to facilitate climate change studies using herbarium specimens. New Phytol 215(1):479–488
Article Google Scholar
Wobbrock JO, Forlizzi J, Hudson SE, Myers BA (2002) Webthumb: interaction techniques for small-screen browsers. In: Proceedings of the 15th annual ACM symposium on User interface software and technology. ACM, pp 205–208
Google Scholar
Xia H, Jacobs J, Agrawala M (2020) Crosscast: adding visuals to audio travel podcasts. In: Proceedings of the 33rd annual ACM symposium on user interface software and technology, pp 735–746
Google Scholar
Yang D, Kraut RE (2017) Persuading teammates to give: systematic versus heuristic cues for soliciting loans. Proc. ACM Hum-Comput Interact 1(CSCW):114:1–114:21
Google Scholar
Yue Y-T, Yang Y-L, Ren G, Wang W (2017) Scenectrl: mixed reality enhancement via efficient scene editing. In: Proceedings of the 30th annual ACM symposium on user interface software and technology, pp 427–436
Google Scholar
Zhang H, Sciutto C, Agrawala M, Fatahalian K (2020) Vid2player: controllable video sprites that behave and appear like professional tennis players. arXiv:2008.04524
Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on Machine learning. ACM, p 116
Google Scholar
Zhou D, Basu S, Mao Y, Platt JC (2012) Learning from the wisdom of crowds by minimax entropy. In: Advances in neural information processing systems, pp 2195–2203
Google Scholar
Zhou S, Gordon M, Krishna R, Narcomey A, Fei-Fei LF, Bernstein M (2019) Hype: a benchmark for human eye perceptual evaluation of generative models. In: Advances in neural information processing systems, pp 3449–3461
Google Scholar

Download references

Acknowledgements

The first project was supported by the National Science Foundation award 1351131. The second project was partially funded by the Brown Institute of Media Innovation and by Toyota Research Institute (“TRI”). The third project was partially funded by a Junglee Corporation Stanford Graduate Fellowship, an Alfred P. Sloan fellowship and by TRI. This chapter solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

Author information

Authors and Affiliations

Stanford University, Stanford, CA, USA
Ranjay Krishna, Mitchell Gordon, Li Fei-Fei & Michael Bernstein

Authors

Ranjay Krishna
View author publications
You can also search for this author in PubMed Google Scholar
Mitchell Gordon
View author publications
You can also search for this author in PubMed Google Scholar
Li Fei-Fei
View author publications
You can also search for this author in PubMed Google Scholar
Michael Bernstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ranjay Krishna .

Editor information

Editors and Affiliations

Google Research (United States), Mountain View, CA, USA
Yang Li
Advanced Interactive Technologies Lab, ETH Zurich, Zurich, Switzerland
Otmar Hilliges

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Krishna, R., Gordon, M., Fei-Fei, L., Bernstein, M. (2021). Visual Intelligence through Human Interaction. In: Li, Y., Hilliges, O. (eds) Artificial Intelligence for Human Computer Interaction: A Modern Approach. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-030-82681-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-82681-9_9
Published: 05 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82680-2
Online ISBN: 978-3-030-82681-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics