Abstract
We introduce a few-shot localization dataset originating from photographers who authentically were trying to learn about the visual content in the images they took. It includes nearly 10,000 segmentations of 100 categories in over 4,500 images that were taken by people with visual impairments. Compared to existing few-shot object detection and instance segmentation datasets, our dataset is the first to locate holes in objects (e.g., found in 12.3% of our segmentations), it shows objects that occupy a much larger range of sizes relative to the images, and text is over five times more common in our objects (e.g., found in 22.4% of our segmentations). Analysis of three modern few-shot localization algorithms demonstrates that they generalize poorly to our new dataset. The algorithms commonly struggle to locate objects with holes, very small and very large objects, and objects lacking text. To encourage a larger community to work on these unsolved challenges, we publicly share our annotated few-shot dataset at https://vizwiz.org.
Y.-Y. Tseng and A. Bell—Equal contribution
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Recorded evidence can be needed by companies for legal reasons.
- 3.
Average hourly wage was $8.00 and $9.61 for classification and IS respectively.
- 4.
For efficiency, we evaluated the presence of text for a random sample of images in COCO-20i that is comparable to the number of images in our dataset: 8,000.
- 5.
We use both the train and validation splits from each of the mainstream datasets for analysis. We randomly sample 10% of the annotations from COCO-20i due to its large size, and we use all annotations from PASCAL-5i and FSOD.
- 6.
We exclude from consideration the other three metrics used to analyze the instance segmentations because boundary complexity is no longer relevant, text prevalence could be incorrect due to the bounding box extending beyond an object’s boundaries, and none of the other datasets located holes in objects.
- 7.
We discuss the limitations of other FSIS algorithms for benchmarking on our dataset in the Supplementary Materials.
- 8.
Of note, we also conducted cross-dataset experiments with YOLACT in the FSIS and FSOD settings however the cross-dataset performance was negligible. We attribute it to unsuccessful training with the chosen hyperparameters, both because the loss plateaued rather than converging with the new YOLACT hyperparameter values used in this paper and the loss exploded when using the original YOLACT values (i.e., the performance of YOLACT reported in the original paper could not be replicated when using the different set of training categories from MS COCO). In summary, the cross-dataset analysis results of YOLACT reinforce our initial findings that YOLACT performance is extremely sensitive to chosen hyperparameters and the training data, with custom tuning for each change.
References
Amirreza Shaban, Shray Bansal, Z.L.I.E., Boots, B.: One-shot learning for semantic segmentation. In: Proceedings of the British Machine Vision Conference (BMVC), pp. 167.1–167.13, September 2017
Bhattacharya, N., Li, Q., Gurari, D.: Why does a visual question have different answers? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4271–4280 (2019)
Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, pp. 333–342 (2010)
American Federation for the Blind: Low vision optical devices. https://www.afb.org/node/16207/low-vision-optical-devices
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV (2019)
Chen, C., Anjum, S., Gurari, D.: Grounding answers for visual questions asked by visually impaired people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19098–19107 (2022)
Chiu, T.Y., Zhao, Y., Gurari, D.: Assessing image quality issues for real-world problems. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3646–3656 (2020)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Desmond, N.: Microsoft’s Seeing AI founder Saqib Shaikh is speaking at Sight Tech Global. https://social.techcrunch.com/2020/08/20/microsofts-seeingai-founder-saqib-shaikh-is-speaking-at-sight-tech-global/
Dong, X., Zheng, L., Ma, F., Yang, Y., Meng, D.: Few-example object detection with model communication. IEEE Trans. Pattern Anal. Mach. Intell. PP, 1 (2018)
Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. (IJCV) 88, 303–338 (2009)
Be My Eyes: Be My Eyes: Our story. https://www.bemyeyes.com/about
Fan, Q., Zhuo, W., Tang, C.K., Tai, Y.W.: Few-shot object detection with attention-RPN and multi-relation detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Fan, Z., et al.: FGN: fully guided network for few-shot instance segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9169–9178. Computer Vision Foundation/IEEE (2020)
Gurari, D., et al.: Predicting foreground object ambiguity and efficiently crowdsourcing the segmentation (s). Int. J. Comput. Vision 126(7), 714–730 (2018)
Gurari, D., et al.: VizWiz-Priv: a dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 939–948 (2019)
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)
Gurari, D., Zhao, Y., Zhang, M., Bhattacharya, N.: Captioning images taken by people who are blind. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 417–434. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_25
Kim, J.-H., Lim, S., Park, J., Cho, H.: Korean localization of visual question answering for blind people. In: SK T-Brain - AI for Social Good Workshop at NeurIPS (2019)
Jiaxu, L., et al.: A comparative review of recent few-shot object detection algorithms (2021)
Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8419–8428, November 2019
Lee, S., Reddie, M., Tsai, C.H., Beck, J., Rosson, M.B., Carroll, J.M.: The emerging professional practice of remote sighted assistance for people with visual impairments. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2020)
Li, X., Wei, T., Chen, Y.P., Tai, Y.W., Tang, C.K.: FSS-1000: a 1000-class dataset for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Massiceti, D., et al.: Orbit: a real-world few-shot dataset for teachable object recognition. In: ICCV 2021, October 2021
Michaelis, C., Ustyuzhaninov, I., Bethge, M., Ecker, A.S.: One-shot instance segmentation. ArXiv (2018)
Nguyen, K., Todorovic, S.: FAPIS: a few-shot anchor-free part-based instance segmenter. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11094–11103 (2021)
Nguyen, K.D.M., Todorovic, S.: Feature weighting and boosting for few-shot segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 622–631 (2019)
Qiao, L., Zhao, Y., Li, Z., Qiu, X., Wu, J., Zhang, C.: DeFRCN: decoupled faster R-CNN for few-shot object detection. ArXiv (2021)
Stangl, A.J., Kothari, E., Jain, S.D., Yeh, T., Grauman, K., Gurari, D.: BrowseWithMe: an online clothes shopping assistant for people with visual impairments. In: Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 107–118 (2018)
Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., Lin, L.: Meta R-CNN: towards general solver for instance-level low-shot learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Zeng, X., Wang, Y., Chiu, T.Y., Bhattacharya, N., Gurari, D.: Vision skills needed to answer visual questions. Proc. ACM Hum.-Comput. Interact. 4(CSCW2), 1–31 (2020)
Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. PP, 1–21 (2019). https://doi.org/10.1109/TNNLS.2018.2876865
Acknowledgments
This project was supported in part by a National Science Foundation SaTC award (#2148080) and gift funding from Microsoft AI4A. We thank Leah Findlater and Yang Wang for contributing to this research idea and the anonymous reviewers for their valuable feedback to improve this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tseng, YY., Bell, A., Gurari, D. (2022). VizWiz-FewShot: Locating Objects in Images Taken by People with Visual Impairments. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-20074-8_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20073-1
Online ISBN: 978-3-031-20074-8
eBook Packages: Computer ScienceComputer Science (R0)