Skip to main content

VizWiz-FewShot: Locating Objects in Images Taken by People with Visual Impairments

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13668))

Included in the following conference series:

  • 1981 Accesses

Abstract

We introduce a few-shot localization dataset originating from photographers who authentically were trying to learn about the visual content in the images they took. It includes nearly 10,000 segmentations of 100 categories in over 4,500 images that were taken by people with visual impairments. Compared to existing few-shot object detection and instance segmentation datasets, our dataset is the first to locate holes in objects (e.g., found in 12.3% of our segmentations), it shows objects that occupy a much larger range of sizes relative to the images, and text is over five times more common in our objects (e.g., found in 22.4% of our segmentations). Analysis of three modern few-shot localization algorithms demonstrates that they generalize poorly to our new dataset. The algorithms commonly struggle to locate objects with holes, very small and very large objects, and objects lacking text. To encourage a larger community to work on these unsolved challenges, we publicly share our annotated few-shot dataset at https://vizwiz.org.

Y.-Y. Tseng and A. Bell—Equal contribution

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Visual assistance services include Microsoft’s Seeing AI, Google’s Lookout, and TapTapSee. The popularity of such services is exemplified by companies’ reports about hundreds of thousands of users and tens of millions of requests [9, 12, 22].

  2. 2.

    Recorded evidence can be needed by companies for legal reasons.

  3. 3.

    Average hourly wage was $8.00 and $9.61 for classification and IS respectively.

  4. 4.

    For efficiency, we evaluated the presence of text for a random sample of images in COCO-20i that is comparable to the number of images in our dataset: 8,000.

  5. 5.

    We use both the train and validation splits from each of the mainstream datasets for analysis. We randomly sample 10% of the annotations from COCO-20i due to its large size, and we use all annotations from PASCAL-5i and FSOD.

  6. 6.

    We exclude from consideration the other three metrics used to analyze the instance segmentations because boundary complexity is no longer relevant, text prevalence could be incorrect due to the bounding box extending beyond an object’s boundaries, and none of the other datasets located holes in objects.

  7. 7.

    We discuss the limitations of other FSIS algorithms for benchmarking on our dataset in the Supplementary Materials.

  8. 8.

    Of note, we also conducted cross-dataset experiments with YOLACT in the FSIS and FSOD settings however the cross-dataset performance was negligible. We attribute it to unsuccessful training with the chosen hyperparameters, both because the loss plateaued rather than converging with the new YOLACT hyperparameter values used in this paper and the loss exploded when using the original YOLACT values (i.e., the performance of YOLACT reported in the original paper could not be replicated when using the different set of training categories from MS COCO). In summary, the cross-dataset analysis results of YOLACT reinforce our initial findings that YOLACT performance is extremely sensitive to chosen hyperparameters and the training data, with custom tuning for each change.

References

  1. Amirreza Shaban, Shray Bansal, Z.L.I.E., Boots, B.: One-shot learning for semantic segmentation. In: Proceedings of the British Machine Vision Conference (BMVC), pp. 167.1–167.13, September 2017

    Google Scholar 

  2. Bhattacharya, N., Li, Q., Gurari, D.: Why does a visual question have different answers? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4271–4280 (2019)

    Google Scholar 

  3. Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, pp. 333–342 (2010)

    Google Scholar 

  4. American Federation for the Blind: Low vision optical devices. https://www.afb.org/node/16207/low-vision-optical-devices

  5. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV (2019)

    Google Scholar 

  6. Chen, C., Anjum, S., Gurari, D.: Grounding answers for visual questions asked by visually impaired people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19098–19107 (2022)

    Google Scholar 

  7. Chiu, T.Y., Zhao, Y., Gurari, D.: Assessing image quality issues for real-world problems. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3646–3656 (2020)

    Google Scholar 

  8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  9. Desmond, N.: Microsoft’s Seeing AI founder Saqib Shaikh is speaking at Sight Tech Global. https://social.techcrunch.com/2020/08/20/microsofts-seeingai-founder-saqib-shaikh-is-speaking-at-sight-tech-global/

  10. Dong, X., Zheng, L., Ma, F., Yang, Y., Meng, D.: Few-example object detection with model communication. IEEE Trans. Pattern Anal. Mach. Intell. PP, 1 (2018)

    Google Scholar 

  11. Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. (IJCV) 88, 303–338 (2009)

    Article  Google Scholar 

  12. Be My Eyes: Be My Eyes: Our story. https://www.bemyeyes.com/about

  13. Fan, Q., Zhuo, W., Tang, C.K., Tai, Y.W.: Few-shot object detection with attention-RPN and multi-relation detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  14. Fan, Z., et al.: FGN: fully guided network for few-shot instance segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9169–9178. Computer Vision Foundation/IEEE (2020)

    Google Scholar 

  15. Gurari, D., et al.: Predicting foreground object ambiguity and efficiently crowdsourcing the segmentation (s). Int. J. Comput. Vision 126(7), 714–730 (2018)

    Article  MathSciNet  Google Scholar 

  16. Gurari, D., et al.: VizWiz-Priv: a dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 939–948 (2019)

    Google Scholar 

  17. Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)

    Google Scholar 

  18. Gurari, D., Zhao, Y., Zhang, M., Bhattacharya, N.: Captioning images taken by people who are blind. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 417–434. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_25

    Chapter  Google Scholar 

  19. Kim, J.-H., Lim, S., Park, J., Cho, H.: Korean localization of visual question answering for blind people. In: SK T-Brain - AI for Social Good Workshop at NeurIPS (2019)

    Google Scholar 

  20. Jiaxu, L., et al.: A comparative review of recent few-shot object detection algorithms (2021)

    Google Scholar 

  21. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8419–8428, November 2019

    Google Scholar 

  22. Lee, S., Reddie, M., Tsai, C.H., Beck, J., Rosson, M.B., Carroll, J.M.: The emerging professional practice of remote sighted assistance for people with visual impairments. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2020)

    Google Scholar 

  23. Li, X., Wei, T., Chen, Y.P., Tai, Y.W., Tang, C.K.: FSS-1000: a 1000-class dataset for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  24. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  25. Massiceti, D., et al.: Orbit: a real-world few-shot dataset for teachable object recognition. In: ICCV 2021, October 2021

    Google Scholar 

  26. Michaelis, C., Ustyuzhaninov, I., Bethge, M., Ecker, A.S.: One-shot instance segmentation. ArXiv (2018)

    Google Scholar 

  27. Nguyen, K., Todorovic, S.: FAPIS: a few-shot anchor-free part-based instance segmenter. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11094–11103 (2021)

    Google Scholar 

  28. Nguyen, K.D.M., Todorovic, S.: Feature weighting and boosting for few-shot segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 622–631 (2019)

    Google Scholar 

  29. Qiao, L., Zhao, Y., Li, Z., Qiu, X., Wu, J., Zhang, C.: DeFRCN: decoupled faster R-CNN for few-shot object detection. ArXiv (2021)

    Google Scholar 

  30. Stangl, A.J., Kothari, E., Jain, S.D., Yeh, T., Grauman, K., Gurari, D.: BrowseWithMe: an online clothes shopping assistant for people with visual impairments. In: Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 107–118 (2018)

    Google Scholar 

  31. Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., Lin, L.: Meta R-CNN: towards general solver for instance-level low-shot learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  32. Zeng, X., Wang, Y., Chiu, T.Y., Bhattacharya, N., Gurari, D.: Vision skills needed to answer visual questions. Proc. ACM Hum.-Comput. Interact. 4(CSCW2), 1–31 (2020)

    Article  Google Scholar 

  33. Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. PP, 1–21 (2019). https://doi.org/10.1109/TNNLS.2018.2876865

Download references

Acknowledgments

This project was supported in part by a National Science Foundation SaTC award (#2148080) and gift funding from Microsoft AI4A. We thank Leah Findlater and Yang Wang for contributing to this research idea and the anonymous reviewers for their valuable feedback to improve this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu-Yun Tseng .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13362 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tseng, YY., Bell, A., Gurari, D. (2022). VizWiz-FewShot: Locating Objects in Images Taken by People with Visual Impairments. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20074-8_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20073-1

  • Online ISBN: 978-3-031-20074-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics