Skip to main content
Log in

REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Machine learning models are known to perpetuate and even amplify the biases present in the data. However, these data biases frequently do not become apparent until after the models are deployed. Our work tackles this issue and enables the preemptive analysis of large-scale datasets. REvealing VIsual biaSEs (REVISE) is a tool that assists in the investigation of a visual dataset, surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based. Object-based biases relate to the size, context, or diversity of the depicted objects. Person-based metrics focus on analyzing the portrayal of people within the dataset. Geography-based analyses consider the representation of different geographic locations. These three dimensions are deeply intertwined in how they interact to bias a dataset, and REVISE sheds light on this; the responsibility then lies with the user to consider the cultural and historical context, and to determine which of the revealed biases may be problematic. The tool further assists the user by suggesting actionable steps that may be taken to mitigate the revealed biases. Overall, the key aim of our work is to tackle the machine learning bias problem early in the pipeline. REVISE is available at https://github.com/princetonvisualai/revise-tool.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. GeoJSON is a JSON-based standard for encoding boundary and region information through GPS data. GeoJSON files for many geographic regions are easily downloadable online, and can be readily converted from shapefiles, another type of geographic boundary file.

  2. Because top-1 accuracy for even the best model on all 365 scenes is 55.19%, but top-5 accuracy is 85.07%, we use the less granular scene categorization at the second tier of the defined scene hierarchy here. For example, aquarium, church indoor, and music studio fall into the scene group of indoor cultural.

  3. We use different subsets of the YFCC100m dataset depending on the particular annotations required by each metric.

  4. We consider the subset of the BDD100K dataset with images in New York City, which is a majority of the dataset.

  5. Random subset of size 100,000.

  6. We also looked into using reverse image searches to recover the query, but the “best guess labels” returned from these searches were not particularly useful, erring on both the side of being much too vague, such as returning “sea” for any scene with water, or too specific, with the exact name and brand of one of the objects.

References

  • Alwassel, H., Heilbron, F. C., Escorcia, V., & Ghanem, B. (2018). Diagnosing error in temporal action detectors. In European conference on computer vision (ECCV).

  • Amazon. (2021). Amazon sagemaker clarify. Retrieved December 2, 2019, from https://aws.amazon.com/sagemaker/clarify/

  • Amazon rekognition. (n.d.). Retrieved December 2, 2019, from https://aws.amazon.com/rekognition/

  • Balakrishnan, G., Xiong, Y., Xia, W., & Perona, P. (2020). Towards causal benchmarking of bias in face analysis algorithms. In European conference on computer vision (ECCV).

  • Bao, M., Zhou, A., Zottola, S., Brubach, B., Desmarais, S., Horowitz, A., ... Venkatasubramanian, S. (2021). It’s compaslicated: The messy relationship between rai datasets and algorithmic fairness benchmarks. arXiv:2106.05498.

  • Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and machine learning. Retrieved December 2, 2019, from http://www.fairmlbook.org.fairmlbook.org.

  • Bearman, S., Korobov, N., & Thorne, A. (2009). The fabric of internalized sexism. Journal of Integrated Social Sciences, 1(1), 10–47.

    Google Scholar 

  • Bellamy, R. K. E., Dey, K., Hend, M., Hoffman, S. C., Houde, S., Kannan, K., ... Zhang, Y. (2018). AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv:1810.01943.

  • Berg, A. C., Berg, T. L., III, H. D., Dodge, J., Goyal, A., Han, X., ... Yamaguchi, K. (2012). Understanding and predicting importance in images. In Conference on computer vision and pattern recognition (CVPR).

  • Birhane, A. (2021). Algorithmic injustice: A relational ethics approach. Patterns, 2, 100205.

  • Birhane, A., Prabhu, V. U., & Kahembwe, E. (2021). Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv:2110.01963.

  • Bolya, D., Foley, S., Hays, J., & Hoffman, J. (2020). TIDE: A general toolbox for identifying object detection errors. In European conference on computer vision (ECCV).

  • Brown, C. (2014). Archives and recordkeeping: Theory into practices. Facet Publishing.

  • Buda, M., Maki, A., & Mazurowski, M. A. (2017). A systematic study of the class imbalance problem in convolutional neural networks. arXiv:1710.05381.

  • Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In ACM conference on fairness, accountability, transparency (FAccT).

  • Burns, K., Hendricks, L. A., Saenko, K., Darrell, T., & Rohrbach, A. (2018). Women also snowboard: Overcoming bias in captioning models. In European conference on computer vision (ECCV).

  • Cadene, R., Dancette, C., Ben-younes, H., Cord, M., & Parikh, D. (2019). RUBi: Reducing unimodal biases in visual question answering. In Advances in neural information processing systems (NeurIPS).

  • Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain humanlike biases. Science, 356(6334), 183–186.

    Article  Google Scholar 

  • Choi, M. J., Torralba, A., & Willsky, A. S. (2012). Context models and out-of-context objects. Pattern Recognition Letters, 33, 853–862.

    Article  Google Scholar 

  • Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 52, 153–163.

    Article  Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., & Scheuerman, M. K. (2020). Bringing the people back in: Contesting benchmark machine learning datasets. arXiv:2007.07399.

  • Denton, E., Hutchinson, B., Mitchell, M., Gebru, T., & Zaldivar, A. (2019). Image counterfactual sensitivity analysis for detecting unintended bias. In CVPR workshop on fairness accountability transparency and ethics in computer vision.

  • DeVries, T., Misra, I., Wang, C., & van der Maaten, L. (2019). Does object recognition work for everyone? In Conference on computer vision and pattern recognition workshops (CVPRW).

  • Ding, F., Hardt, M., Miller, J., & Schmidt, L. (2021). Retiring adult: New datasets for fair machine learning. arXiv:2108.04884.

  • Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference.

  • Dwork, C., Immorlica, N., Kalai, A. T., & Leiserson, M. (2017). Decoupled classifiers for fair and efficient machine learning. arXiv:1707.06613.

  • Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision (IJCV), 88, 303–338.

    Article  Google Scholar 

  • Fabbrizzi, S., Papadopoulos, S., & Eirini Ntoutsi, I. K. (2021). A survey on bias in visual datasets. arXiv:2107.07919.

  • Facebook AI. (2021). Fairness flow. Retrieved from https://ai.facebook.com/blog/how-were-using-fairness-flow-to-help-build-ai-that-works-better-for-everyone/

  • Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In IEEE CVPR workshop of generative model based vision.

  • Fitzpatrick, T. B. (1988). The validity and practicality of sun-reactive skin types I through VI. Archives of Dermatology, 6, 869–871.

    Article  Google Scholar 

  • Gajane, P., & Pechenizkiy, M. (2017). On formalizing fairness in prediction with machine learning. arXiv:1710.03184.

  • Galleguillos, C., Rabinovich, A., & Belongie, S. (2008). Object categorization using co-occurrence, location and appearance. In Conference on computer vision and pattern recognition (CVPR).

  • Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E. L., & Fei-Fei, L. (2017). Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 114(50), 13108–13113. https://doi.org/10.1073/pnas.1700035114

    Article  Google Scholar 

  • Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., & Crawford, K. (2018). Datasheets for datasets. In ACM conference on fairness, accountability, transparency (FAccT).

  • Google People + AI Research. (2021). Know your data. Retrieved from https://knowyourdata.withgoogle.com/.

  • Green, B., & Hu, L. (2018). The myth in the methodology: Towards a recontextualization of fairness in machine learning. In Machine learning: The debates workshop at the 35th international conference on machine learning.

  • Hamidi, F., Scheuerman, M. K., & Branham, S. (2018). Gender recognition or gender reductionism? The social implications of embedded gender recognition systems. In Conference on human factors in computing systems (CHI).

  • Hanna, A., Denton, E., Smart, A., & Smith-Loud, J. (2020). Towards a critical race methodology in algorithmic fairness. In ACM conference on fairness, accountability, transparency (FAccT).

  • Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Advances in neural information processing systems (NeurIPS).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In European conference on computer vision (ECCV).

  • Hill, K. (2020). Wrongfully accused by an algorithm. The New York Times. Retrieved from https://www.nytimes.com/2020/06/24/technology/facial-recognition-arrest.html.

  • Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In European conference on computer vision (ECCV).

  • Holland, S., Hosny, A., Newman, S., Joseph, J., & Chmielinski, K. (2018). The dataset nutrition label: A framework to drive higher data quality standards. arXiv:1805.03677.

  • Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in python. https://doi.org/10.5281/zenodo.1212303

  • Hua, J., Xiong, Z., Lowey, J., Suh, E., & Dougherty, E. R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21, 1509–1515.

    Article  Google Scholar 

  • Idelbayev, Y. (2019). Retrieved from https://github.com/akamaster/pytorchresnetcifar10

  • Jacobs, A. Z., & Wallach, H. (2021). Measurement and fairness. In ACM conference on fairness, accountability, transparency (FAccT).

  • Jain, A. K., & Waller, W. (1978). On the optimal number of features in the classification of multivariate gaussian data. Pattern Recognition, 10, 365–374.

    Article  Google Scholar 

  • Jo, E. S., & Gebru, T. (2020). Lessons from archives: Strategies for collecting sociocultural data in machine learning. In ACM conference on fairness, accountability, transparency (FAccT).

  • Jonckheere, A. R. (1954). A distribution-free k-sample test against ordered alternatives. Biometrika, 41, 133–145.

    Article  MathSciNet  Google Scholar 

  • Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016). Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

  • Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

  • Kay, M., Matuszek, C., & Munson, S. A. (2015). Unequal representation and gender stereotypes in image search results for occupations. Human Factors in Computing Systems, 33, 3819–3828.

    Article  Google Scholar 

  • Keeping Track Online. (2019). Median incomes. Retrieved from https://data.cccnewyork.org/data/table/66/median-incomes#66/107/62/a/a.

  • Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In European conference on computer vision (ECCV).

  • Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., & Schölkopf, B. (2017). Avoiding discrimination through causal reasoning. In Advances in neural information processing systems (NeurIPS).

  • Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. In Proceedings of innovations in theoretical computer science (ITCS).

  • Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-ElHaija, S., Kuznetsova, A., ... Murphy, K. (2017). Openimages: A public dataset for large-scale multilabel and multi-class image classification. Dataset available from https://github.com/openimages.

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., ... Fei-Fei, L. (2016). Visual genome: Connecting language and vision using crowdsourced dense image annotations. Retrieved from https://arxiv.org/abs/1602.07332

  • Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS) (pp. 1097–1105).

  • Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., ... Dollar, P. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV).

  • Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, 39, 539–550.

    Article  Google Scholar 

  • Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2019). A survey on bias and fairness in machine learning. arXiv:1908.09635.

  • Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., ... Gebru, T. (2019). Model cards for model reporting. In ACM conference on fairness, accountability, transparency (FAccT).

  • Moulton, J. (1981). The myth of the neutral ‘man’. In Sexist language: A modern philosophical analysis (pp. 100–116).

  • Ojala, M., & Garriga, G. C. (2010). Permutation tests for studying classifier performance. Journal of Machine Learning Research, 11, 1833–1863.

    MathSciNet  MATH  Google Scholar 

  • Oksuz, K., Cam, B. C., Kalkan, S., & Akbas, E. (2019). Imbalance Problems in Object Detection: A Review. arXiv e-prints, arXiv:1909.00169. eprint: 1909. 00169

  • Oliva, A., & Torralba, A. (2007). The role of context in object recognition. Trends in Cognitive Sciences, 11, 520–527.

    Article  Google Scholar 

  • Ouyang, W., Wang, X., Zhang, C., & Yang, X. (2016). Factors in finetuning deep model for object detection with long-tail distribution. In Conference on computer vision and pattern recognition (CVPR).

  • Paullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2020). Data and its (dis)contents: A survey of dataset development and use in machine learning research. In NeurIPS workshop: ML retrospectives, surveys, and meta-analyses.

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  • Peng, K., Mathur, A., & Narayanan, A. (2021). Mitigating dataset harms requires stewardship: Lessons from 1000 papers. In Advances in Neural Information Processing Systems (NeurIPS).

  • Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). On fairness and calibration. In Advances in neural information processing systems (NeurIPS).

  • Prabhu, V. U., & Birhane, A. (2020). Large image datasets: A pyrrhic win for computer vision? arXiv:2006.16923.

  • Roll, U., Correia, R. A., & Berger-Tal, O. (2018). Using machine learning to disentangle homonyms in large text corpora. Conservation Biology, 32, 716–724.

    Article  Google Scholar 

  • Rosenfeld, A., Zemel, R., & Tsotsos, J. K. (2018). The elephant in the room. arXiv:1808.03305.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  • Salakhutdinov, R., Torralba, A., & Tenenbaum, J. (2011). Learning to share visual appearance for multiclass object detection. In Conference on computer vision and pattern recognition (CVPR).

  • Sattigeri, P., Hoffman, S. C., Chenthamarakshan, V., & Varshney, K. R. (2019). Fairness GAN. IBM Journal of Research and Development, 63, 3-1–3-9.

  • Scheuerman, M. K., Hanna, A., & Denton, E. (2021). Do datasets have politics? disciplinary values in computer vision dataset development. In ACM conference on computer-supported cooperative work and social computing (CSCW).

  • Scheuerman, M. K., Wade, K., Lustig, C., & Brubaker, J. R. (2020). How we’ve taught algorithms to see identity: Constructing race and gender in image databases for facial analysis. In Proceedings of the ACM on human–computer interaction.

  • Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). No classification without representation: Assessing geodiversity issues in open datasets for the developing world. In NeurIPS workshop: Machine learning for the developing world.

  • Sharmanska, V., Hendricks, L. A., Darrell, T., & Quadrianto, N. (2020). Contrastive examples for addressing the tyranny of the majority. arXiv:2004.06524.

  • Sheeny, M., Pellegrin, E. D., Mukherjee, S., Ahrabian, A., Wang, S., & Wallace, A. (2021). RADIATE: A radar dataset for automotive perception in bad weather. In IEEE international conference on robotics and automation (ICRA).

  • Sigurdsson, G. A., Russakovsky, O., & Gupta, A. (2017). What actions are needed for understanding human actions in videos? In International conference on computer vision (ICCV).

  • Steed, R., & Caliskan, A. (2021). Image representations learned with unsupervised pre-training contain human-like biases. In Conference on fairness, accountability, and transparency (FAccT).

  • Swinger, N., De-Arteaga, M., IV, N. H., Leiserson, M., & Kalai, A. (2019). What are the biases in my word embedding? In Proceedings of the AAAI/ACM conference on artificial intelligence, ethics, and society (AIES).

  • The United States Census Bureau. (2019). American community survey 1-year estimates, table s1903 (2005–2019). Retrieved from https://data.census.gov/.

  • Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Li, & L.-J. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM, 59, 64–73.

  • Tommasi, T., Patricia, N., Caputo, B., & Tuytelaars, T. (2015). A deeper look at dataset bias. In German conference on pattern recognition.

  • Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In Conference on computer vision and pattern recognition (CVPR).

  • Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large dataset for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.

    Article  Google Scholar 

  • United Nations Statistics Division. (2019). United Nations statistics division - methodology. Retrieved from https://unstats.un.org/unsd/methodology/m49/.

  • van Miltenburg, E., Elliott, D., & Vossen, P. (2018). Talking about other people: An endless range of possibilities. In International natural language generation conference.

  • Wang, A., Narayanan, A., & Russakovsky, O. (2020). REVISE: A tool for measuring and mitigating bias in visual datasets. In European conference on computer vision (ECCV).

  • Wang, A., & Russakovsky, O. (2021). Directional bias. In International conference on machine learning (ICML).

  • Wang, Z., Qinami, K., Karakozis, Y., Genova, K., Nair, P., Hata, K., & Russakovsky, O. (2020). Towards fairness in visual recognition: Effective strategies for bias mitigation. In Conference on computer vision and pattern recognition (CVPR).

  • Wilson, B., Hoffman, J., & Morgenstern, J. (2019). Predictive inequity in object detection. arXiv:1902.11097

  • Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In Conference on computer vision and pattern recognition (CVPR).

  • Yang, J., Price, B., Cohen, S., & Yang, M.-H. (2014). Context driven scene parsing with attention to rare classes. In Conference on computer vision and pattern recognition (CVPR).

  • Yang, K., Qinami, K., Fei-Fei, L., Deng, J., & Russakovsky, O. (2020). Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In ACM conference on fairness, accountability, transparency (FAccT).

  • Yang, K., Russakovsky, O., & Deng, J. (2019). Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. In International conference on computer vision (ICCV).

  • Yang, K., Yau, J., Fei-Fei, L., Deng, J., & Russakovsky, O. (2021). A study of face obfuscation in imagenet. arXiv:2103.06191.

  • Yao, Y., Zhang, J., Shen, F., Hua, X., Xu, J., & Tang, Z. (2017). Exploiting web images for dataset construction: A domain robust approach. IEEE Transactions on Multimedia, 19, 1771–1784.

    Article  Google Scholar 

  • Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., ... Darrell, T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM conference on AI, ethics, and society.

  • Zhao, D., Wang, A., & Russakovsky, O. (2021). Understanding and evaluating racial biases in image captioning. In CoRR, arXiv:2106.08503.

  • Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).

  • Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1452–1464.

    Article  Google Scholar 

  • Zhu, X., Anguelov, D., & Ramanan, D. (2014). Capturing long-tail distributions of object subcategories. In Conference on computer vision and pattern recognition (CVPR).

Download references

Acknowledgements

This work is partially supported by the National Science Foundation under Grant No. 1763642 and No. 1704444. We would also like to thank Felix Yu, Vikram Ramaswamy, and Zhiwei Deng for their helpful comments, and Zeyu Wang, Deniz Oktay, and Nobline Yoo for testing out the tool and providing feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Angelina Wang.

Additional information

Communicated by Diane Larlus.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

1.1 Gender Label Inference

An additional person-based metric we consider is gender label inference. Specifically, we note two especially concerning practices of assigning gender to a person who is too small to be identifiable, or no face is detected in the image. This is not to say that if these cases are not met it is acceptable to assign gender, as gender cannot be visually perceived by an annotator, but merely that assigning gender when one of these two cases is applicable is a particularly egregious practice. For example, it’s been shown that in images where a person is fully clad with snowboarding equipment and a helmet, they are still labeled as male (Burns et al., 2018) due to preconceived stereotypes. We investigate the contextual cues annotators rely on to assign gender, and consider the gender of a person unlikely to be identifiable if the person is too small (below 1000 pixels, which is the number of dimensions that humans require to perform certain recognition tasks in color images Torralba et al., 2008) or if automated face detection (we used Amazon Rekognition (“Amazon Rekognition” , n.d.), but note that any other face detection tool can be used) fails. For COCO, we find that among images with a human whose gender is unlikely to be identifiable, 77% are labeled male. In OpenImages,Footnote 5 this fraction is 69%. Thus, annotators seem to default to labeling a person as male when they cannot identify the gender; the use of male-as-norm is a problematic practice (Moulton, 1981). Further, we find that annotators are most likely to default to male as a gender label in outdoor sports fields, parks scenes, which is 2.9x the rate of female. Similarly, the rate for indoor transportation scenes is 4.2x and outdoor transportation is 4.5x, with the closest ratio being in shopping and dining, where male is 1.2x as likely as female. This suggests that in the absence of gender cues from the person themselves, annotators make inferences based on image context. In Fig. 17 we show examples from OpenImages where our tool determined that gender definitely should not be inferred, but was. Because attributes like skin tone can be inferred from parts of the image, such as a person’s arm, we do not consider that attribute in this analysis.

This metric of gender label inference also brings up a larger question of which situations, if any, gender labels should ever be assigned (Scheuerman et al., 2020; Hamidi et al., 2018). However, that is outside the scope of this work, where we simply recommend that dataset creators should give clearer guidance to annotators, and remove the gender labels on images where gender can definitely not be determined. We note that while we picked out two criteria of when a person is too small and when there is no face detected to be instances in which gender inference is particularly egregious, there are many other situations that users may wish to delineate for their own purposes.

Fig. 17
figure 17

Examples from OpenImages where annotators assigned gender to the person, but they should not have. The criteria used are that the person is either too small or has no face detected

1.2 Validating Distance as a Proxy for Interaction

In Sect. 5.1, Instance Counts and Distances, we make the claim that we can use distance between a person and an object as a proxy for if the person, p, is actually interacting with the object, o, as opposed to just appearing in the same image with it. This allows us to get more meaningful insight as to how genders may be interacting with objects differently. The distance measure we define is \(dist = \frac{\text {distance between p and o centers}}{\sqrt{\text {area}_{\mathrm {p}}*\text {area}_{\mathrm {o}}}}\), which is a relative measure within each object class because it makes the assumption that all people are the same size, and all instances of an object are the same size. To validate the claim we are making, we look at the SpatialSense dataset (Yang et al., 2019); specifically, at 6 objects that we hope to be somewhat representative of the different ways people interact with objects: ball, book, car, dog, guitar, and table. These objects were picked over ones such as wall or floor, in which it is more ambiguous what counts as an interaction. We then hand-labeled the images where this object cooccurs with a human as “yes” or “no” based on whether the person of interest is interacting with the object or not. We pick the threshold by optimizing for mean per class accuracy, where every distance below it as classified as a “yes” interaction and every distance above it as a “no” interaction. The threshold is picked based on the same data that the accuracy is reported for.

Table 4 Distances are classified as “yes” or “no” interaction based on a threshold optimized for mean per class accuracy

As can be seen in Table 4, for all 6 categories the mean of the distances when someone is interacting with an object is lower than that of when someone is not. This matches our claim that distance, although imperfect, can serve as a proxy for interaction. From looking at the visualization of the distribution of the distances in Fig. 18, we can see that for certain objects like ball and table, which also have the lowest mean per class accuracy, there is more overlap between the distances for “yes” interactions and “no” interactions. Intuitively, this makes some sense, because a ball is an object that can be interacted with both from a distance and from direct contact, and for table in the labeled examples, people were often seated at a table but not directly interacting with it.

1.3 Pairwise Queries

In Sect. 4.2, another claim we make is that pairwise queries of the form “[Desired Object] and [Suggested Query Term]” could allow dataset collectors to augment their dataset with the types of images they want. One of the examples we gave is that if one notices the images of airplane in their dataset are overrepresented in the larger sizes, our tool would recommend they make the query “airplane and surfboard” to augment their dataset, because based on the distribution of training samples, this combination is more likely than other kinds of queries to lead to images of smaller airplanes.

However, there are a few concerns with this approach. For one, certain queries might not return any search results. This is especially the case when the suggested query term is a scene category, such as indoor cultural, in which the query “pizza and indoor cultural” might not be very fruitful. To deal with this, we can substitute the scene category, indoor cultural, for more specific scenes in that category, like classroom and conference, so that the query becomes something like “pizza and classroom”. When the suggested query term involves an object, there is another approach we can take. In datasets like PASCAL VOC (Everingham et al., 2010), the set of queries used to collect the dataset is given. For example, to get pictures of boat, they also queried for barge, ferry, and canoe. Thus, in addition to querying, for example, “airplane and boat”, one could also query for “airplane and ferry”, “airplane and barge”, etc.

Fig. 18
figure 18

Distances for the objects that were hand-labeled, orange if there is an interaction, and blue if there is not. The red vertical line is the threshold along which everything below is classified as “yes”, and everything above is classified as “no”

Fig. 19
figure 19

Screenshots of top results from performing queries on Flickr that satisfy the tags mentioned. For train, when it is queried with boat, the train itself is more likely to be farther away, and thus smaller. When queried with backpack, the image is more likely to show travelers right next to, or even inside of, a train, and thus show it more in the foreground. The same idea applies for pizza where it’s imaged from further in the background when paired with an indoor cultural scene, and up close with broccoli

Fig. 20
figure 20

Screenshots of top results from performing queries on Flickr that satisfy the tags mentioned. For bed, sink provides a context that makes it more likely to be imaged further away, whereas cat brings bed to the forefront. The same is the case when the object of interest is now cat, where a pairwise query with sheep makes it more likely to be further, and suitcase to be closer

Another concern is there might be a distribution difference between the correlation observed in the data and the correlation in images returned for queries. For example, just because cat and dog cooccur at a certain rate in the dataset, does not necessarily mean they cooccur at this same rate in search engine images. However, our query recommendation rests on the assumptions that datasets are constructed by querying a search engine, and that objects cooccur at roughly the same relative rates in the dataset as they do in query returns; for example, that because train cooccurring with boat in our dataset tends to be more likely to be small, in images returned from queries, train is also likely to be smaller if boat is in the image. We make an assumption that for an image that contains a train and boat, the query “train and boat” would recover these kinds of images back, but it could be the case that the actual query used to find this image was “coastal transit.” If we had access to the actual query used to find each image, the conditional probability could then be calculated over the queries themselves rather than the object or scene cooccurrences. It is because we don’t have these original queries that we use cooccurrences to serve as a proxy for recovering them.

To gain some confidence in our use of these pairwise queries in place of the original queries, we show qualitative examples of the results when searching on Flickr for images that contain the tags of the object(s) searched. We show the results of querying for (1) just the object (2) the object and query term that we would hope leads to more of the object in a smaller size, and (3) the object and query term that we would hope leads to more of the object in a bigger size. In Figs. 19 and 20 we show the results of images sorted by relevance under the Creative Commons license. We can see that when we perform these pairwise queries, we do indeed have some level of control over the size of the object in the resulting images. For example, “pizza and classroom” and “pizza and conference” queries (scenes swapped in for indoor cultural) return smaller pizzas than the “pizza and broccoli” query, which tends to feature bigger pizzas that take up the whole image. This could of course create other representation issues such as a surplus of pizza and broccoli images, so it could be important to use more than one of the recommended queries our tool surfaces. Although this is an imperfect method, it is still a useful tactic we can use without having access to the actual queries used to create the dataset.Footnote 6

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, A., Liu, A., Zhang, R. et al. REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets. Int J Comput Vis 130, 1790–1810 (2022). https://doi.org/10.1007/s11263-022-01625-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01625-5

Keywords

Navigation