Advertisement

Semantic Curiosity for Active Visual Learning

Conference paper
  • 920 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12351)

Abstract

In this paper, we study the task of embodied interactive learning for object detection. Given a set of environments (and some labeling budget), our goal is to learn an object detector by having an agent select what data to obtain labels for. How should an exploration policy decide which trajectory should be labeled? One possibility is to use a trained object detector’s failure cases as an external reward. However, this will require labeling millions of frames required for training RL policies, which is infeasible. Instead, we explore a self-supervised approach for training our exploration policy by introducing a notion of semantic curiosity. Our semantic curiosity policy is based on a simple observation – the detection outputs should be consistent. Therefore, our semantic curiosity rewards trajectories with inconsistent labeling behavior and encourages the exploration policy to explore such areas. The exploration policy trained via semantic curiosity generalizes to novel scenes and helps train an object detector that outperforms baselines trained with other possible alternatives such as random exploration, prediction-error curiosity, and coverage-maximizing exploration.

Keywords

Embodied learning Active visual learning Semantic curiosity Exploration 

Notes

Acknowledgements

This work was supported by IARPA DIVA D17PC00340, ONR MURI, ONR Grant N000141812861, ONR Young Investigator, DARPA MCS, and NSF Graduate Research Fellowship. We would also like to thank NVIDIA for GPU support.

Licenses for referenced datasets:

Gibson: http://svl.stanford.edu/gibson2/assets/GDS_agreement.pdf

Matterport3D: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf

Replica: https://raw.githubusercontent.com/facebookresearch/Replica-Dataset/master/LICENSE.

Supplementary material

Supplementary material 1 (mp4 5714 KB)

Supplementary material 2 (mp4 9164 KB)

Supplementary material 3 (mp4 5119 KB)

Supplementary material 4 (mp4 9828 KB)

Supplementary material 5 (mp4 1906 KB)

Supplementary material 6 (mp4 8418 KB)

References

  1. 1.
    Ammirato, P., Poirson, P., Park, E., Košecká, J., Berg, A.C.: A dataset for developing and benchmarking active vision. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). pp. 1378–1385. IEEE (2017)Google Scholar
  2. 2.
    Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)Google Scholar
  3. 3.
    Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3(Nov), 397–422 (2002)Google Scholar
  4. 4.
    Badrinarayanan, V., Galasso, F., Cipolla, R.: Label propagation in video sequences. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3265–3272. IEEE (2010)Google Scholar
  5. 5.
    Bajcsy, R.: Active perception. Proc. IEEE 76(8), 966–1005 (1988)CrossRefGoogle Scholar
  6. 6.
    Bengio, Y., Delalleau, O., Le Roux, N.: Label propagation and quadratic criterion. In: Semi-Supervised Learning (2006)Google Scholar
  7. 7.
    Chandra, S., Couprie, C., Kokkinos, I.: Deep spatio-temporal random fields for efficient video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8915–8924 (2018)Google Scholar
  8. 8.
    Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV) (2017). http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf
  9. 9.
    Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural SLAM. In: ICLR (2020). https://openreview.net/forum?id=HklXn1BKDH
  10. 10.
    Chaplot, D.S., Lample, G.: Arnold: an autonomous agent to play fps games. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)Google Scholar
  11. 11.
    Chaplot, D.S., Parisotto, E., Salakhutdinov, R.: Active neural localization. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=ry6-G_66b
  12. 12.
    Chaplot, D.S., Salakhutdinov, R., Gupta, A., Gupta, S.: Neural topological SLAM for visual navigation. In: CVPR (2020)Google Scholar
  13. 13.
    Chaplot, D.S., Sathyendra, K.M., Pasumarthi, R.K., Rajagopal, D., Salakhutdinov, R.: Gated-attention architectures for task-oriented language grounding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  14. 14.
    Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SyMWn05F7
  15. 15.
    Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1409–1416 (2013)Google Scholar
  16. 16.
    Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR (2018)Google Scholar
  17. 17.
    Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: ICLR (2017)Google Scholar
  18. 18.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
  19. 19.
    Eysenbach, B., Gupta, A., Ibarz, J., Levine, S.: Diversity is all you need: learning skills without a reward function. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SJx63jRqFm
  20. 20.
    Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: CVPR (2019)Google Scholar
  21. 21.
    Fathi, A., Balcan, M.F., Ren, X., Rehg, J.M.: Combining self training and active learning for video segmentation. In: Proceedings of the British Machine Vision Conference, Georgia Institute of Technology (2011)Google Scholar
  22. 22.
    Fox, D., Burgard, W., Thrun, S.: Active Markov localization for mobile robots. Robot. Auton. Syst. 25(3–4), 195–207 (1998)CrossRefGoogle Scholar
  23. 23.
    Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representation warping. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4453–4462 (2017)Google Scholar
  24. 24.
    Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1183–1192. JMLR.org (2017)Google Scholar
  25. 25.
    Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)Google Scholar
  26. 26.
    Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625 (2017)Google Scholar
  27. 27.
    Hermann, K.M., et al.: Grounded language learning in a simulated 3D world. arXiv preprint arXiv:1706.06551 (2017)
  28. 28.
    Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11(Apr), 1563–1600 (2010)Google Scholar
  29. 29.
    Jayaraman, D., Grauman, K.: Learning to look around: intelligently exploring unseen environments for unknown tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1238–1247 (2018)Google Scholar
  30. 30.
    Kuo, W., Häne, C., Yuh, E., Mukherjee, P., Malik, J.: Cost-sensitive active learning for intracranial hemorrhage detection. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11072, pp. 715–723. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-00931-1_82CrossRefGoogle Scholar
  31. 31.
    Lample, G., Chaplot, D.S.: Playing FPS games with deep reinforcement learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)Google Scholar
  32. 32.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)Google Scholar
  33. 33.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  34. 34.
    Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of 8th International Conference on Computer Vision, vol. 2, pp. 416–423, July 2001Google Scholar
  35. 35.
    Mirowski, P., et al.: Learning to navigate in complex environments. ICLR (2017)Google Scholar
  36. 36.
    Misra, I., Girshick, R., Fergus, R., Hebert, M., Gupta, A., van der Maaten, L.: Learning by asking questions. In: CVPR (2018)Google Scholar
  37. 37.
    Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17 (2017)Google Scholar
  38. 38.
    Pathak, D., Gandhi, D., Gupta, A.: Self-supervised exploration via disagreement. In: ICML (2019)Google Scholar
  39. 39.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  40. 40.
    Savva, M., Chang, A.X., Dosovitskiy, A., Funkhouser, T., Koltun, V.: MINOS: multimodal indoor simulator for navigation in complex environments. arXiv:1712.03931 (2017)
  41. 41.
    Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)Google Scholar
  42. 42.
    Schmidhuber, J.: A possibility for implementing curiosity and boredom in model-building neural controllers. In: Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pp. 222–227 (1991)Google Scholar
  43. 43.
    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
  44. 44.
    Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=H1aIuk-RW
  45. 45.
    Settles, B.: Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences (2009)Google Scholar
  46. 46.
    Siddiqui, Y., Valentin, J., Nießner, M.: ViewAL: active learning with viewpoint entropy for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9433–9443 (2020)Google Scholar
  47. 47.
    Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019). https://github.com/facebookresearch/Replica-Dataset/blob/master/LICENSE
  48. 48.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998). http://www.cs.ualberta.ca/~sutton/book/the-book.html
  49. 49.
    Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. TPAMI (2019)Google Scholar
  50. 50.
    Vijayanarasimhan, S., Grauman, K.: Large-scale live active learning: training object detectors with crawled data and crowds. Int. J. Comput. Vis. 108(1–2), 97–114 (2014)MathSciNetCrossRefGoogle Scholar
  51. 51.
    Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vis. 1–21 (2013). http://dx.doi.org/10.1007/s11263-012-0564-1
  52. 52.
    Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
  53. 53.
    Wu, Y., Tian, Y.: Training agent for first-person shooter game with actor-critic curriculum learning. In: ICLR (2017)Google Scholar
  54. 54.
    Xia, F., Zamir, A.R., He, Z.Y., Sax, A., Malik, J., Savarese, S.: Gibson Env: real-world perception for embodied agents. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018). http://svl.stanford.edu/gibson2/assets/GDS_agreement.pdf
  55. 55.
    Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Visual curiosity: learning to ask questions to learn visual recognition. In: Conference on Robot Learning, pp. 63–80 (2018)Google Scholar
  56. 56.
    Yang, J., et al.: Embodied amodal recognition: learning to move to perceive objects. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2040–2050 (2019)Google Scholar
  57. 57.
    Yoo, D., Kweon, I.S.: Learning loss for active learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102 (2019)Google Scholar
  58. 58.
    Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.UIUCChampaignUSA

Personalised recommendations