Advertisement

Women Also Snowboard: Overcoming Bias in Captioning Models

  • Lisa Anne HendricksEmail author
  • Kaylee Burns
  • Kate Saenko
  • Trevor Darrell
  • Anna Rohrbach
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)

Abstract

Most machine learning methods are known to capture and exploit biases of the training data. While some biases are beneficial for learning, others are harmful. Specifically, image captioning models tend to exaggerate biases present in training data (e.g., if a word is present in 60% of training sentences, it might be predicted in 70% of sentences at test time). This can lead to incorrect captions in domains where unbiased captions are desired, or required, due to over-reliance on the learned prior and image context. In this work we investigate generation of gender-specific caption words (e.g. man, woman) based on the person’s appearance or the image context. We introduce a new Equalizer model that encourages equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidence is present. The resulting model is forced to look at a person rather than use contextual cues to make a gender-specific prediction. The losses that comprise our model, the Appearance Confusion Loss and the Confident Loss, are general, and can be added to any description model in order to mitigate impacts of unwanted bias in a description dataset. Our proposed model has lower error than prior work when describing images with people and mentioning their gender and more closely matches the ground truth ratio of sentences including women to sentences including men. Finally, we show that our model more often looks at people when predicting their gender (https://people.eecs.berkeley.edu/~lisa anne/snowboard.html).

Keywords

Image description Caption bias Right for the right reasons 

Notes

Acknowledgements

This work was partially supported by US DoD, the DARPA XAI program, and the Berkeley Artificial Intelligence Research (BAIR) Lab.

Supplementary material

474178_1_En_47_MOESM1_ESM.pdf (1 mb)
Supplementary material 1 (pdf 1057 KB)

References

  1. 1.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  2. 2.
    Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender and the varieties of self-expression. First Monday 12(9) (2007)Google Scholar
  3. 3.
    Barocas, S., Selbst, A.D.: Big data’s disparate impact. Calif. Law Rev. 104, 671 (2016)Google Scholar
  4. 4.
    Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: Advances in Neural Information Processing Systems (NIPS), pp. 4349–4357 (2016)Google Scholar
  5. 5.
    Buolamwini, J.A.: Gender shades: intersectional phenotypic and demographic evaluation of face datasets and gender classifiers. Ph.D. thesis, Massachusetts Institute of Technology (2017)Google Scholar
  6. 6.
    Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1301–1309. Association for Computational Linguistics (2011)Google Scholar
  7. 7.
    Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM (2015)Google Scholar
  8. 8.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634 (2015)Google Scholar
  9. 9.
    Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.: Fairness through awareness. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214–226. ACM (2012)Google Scholar
  10. 10.
    Eidinger, E., Enbar, R., Hassner, T.: Age and gender estimation of unfiltered faces. IEEE Trans. Inf. For. Secur. 9(12), 2170–2179 (2014)CrossRefGoogle Scholar
  11. 11.
    Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  12. 12.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R* CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1080–1088 (2015)Google Scholar
  13. 13.
    Gordon, J., Van Durme, B.: Reporting bias and knowledge acquisition. In: Proceedings of the 2013 workshop on Automated Knowledge Base Construction, pp. 25–30. ACM (2013)Google Scholar
  14. 14.
    Hardt, M., Price, E., Srebro, N., et al.: Equality of opportunity in supervised learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 3315–3323 (2016)Google Scholar
  15. 15.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015)Google Scholar
  16. 16.
    Larson, B.N.: Gender as a variable in natural-language processing: Ethical considerations (2017)Google Scholar
  17. 17.
    Lavie, M.D.A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), p. 376 (2014)Google Scholar
  18. 18.
    Levi, G., Hassner, T.: Age and gender classification using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), pp. 34–42 (2015)Google Scholar
  19. 19.
    Lin, J.: Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  21. 21.
    van Miltenburg, E.: Stereotyping and bias in the Flickr30k dataset. In: Workshop on Multimodal Corpora: Computer Vision and Language Processing (2016)Google Scholar
  22. 22.
    Misra, I., Zitnick, C.L., Mitchell, M., Girshick, R.: Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2930–2939. IEEE (2016)Google Scholar
  23. 23.
    President of the United Search Engine Optimization, Podesta, J.: Big data: seizing opportunities, preserving values. White House, Executive Office of the President (2014)Google Scholar
  24. 24.
    Quadrianto, N., Petterson, J., Smola, A.J.: Distribution matching for transduction. In: Advances in Neural Information Processing Systems (NIPS), pp. 1500–1508 (2009)Google Scholar
  25. 25.
    Quadrianto, N., Sharmanska, V.: Recycling privileged learning and distribution matching for fairness. In: Advances in Neural Information Processing Systems (NIPS), pp. 677–688 (2017)Google Scholar
  26. 26.
    Ramanishka, V., Das, A., Zhang, J., Saenko, K.: Top-down visual saliency guided by captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, p. 7 (2017)Google Scholar
  27. 27.
    Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016)Google Scholar
  28. 28.
    Ross, A.S., Hughes, M.C., Doshi-Velez, F.: Right for the right reasons: training differentiable models by constraining their explanations. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (2017)Google Scholar
  29. 29.
    Ryu, H.J., Adam, H., Mitchell, M.: Inclusivefacenet: Improving face attribute detection with race and gender diversity. In: Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) (2018)Google Scholar
  30. 30.
    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  31. 31.
    Stock, P., Cisse, M.: ConvNets and imageNet beyond accuracy: explanations, bias detection, adversarial examples and model criticism. arXiv preprint arXiv:1711.11443 (2017)
  32. 32.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2818–2826 (2016)Google Scholar
  33. 33.
    Tan, S., Caruana, R., Hooker, G., Lou, Y.: Detecting bias in black-box models using transparent model distillation. In: AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (2018)Google Scholar
  34. 34.
    Torralba, A.: Contextual modulation of target saliency. In: Advances in Neural Information Processing Systems (NIPS), pp. 1303–1310 (2002)Google Scholar
  35. 35.
    Torralba, A., Sinha, P.: Statistical context priming for object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 1, pp. 763–770. IEEE (2001)Google Scholar
  36. 36.
    Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4068–4076. IEEE (2015)Google Scholar
  37. 37.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. IEEE (2015)Google Scholar
  38. 38.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296. IEEE (2016)Google Scholar
  39. 39.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 2048–2057 (2015)Google Scholar
  40. 40.
    Yan, X., Yan, L.: Gender classification of weblog authors. In: AAAI Spring Symposium: computational Approaches to Analyzing Weblogs, pp. 228–230. Palo Alto (2006)Google Scholar
  41. 41.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Linguist. (TACL) 2, 67–78 (2014)Google Scholar
  42. 42.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_53CrossRefGoogle Scholar
  43. 43.
    Zhang, B.H., Lemoine, B., Mitchell, M.: Mitigating unwanted biases with adversarial learning. In: AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) (2018)Google Scholar
  44. 44.
    Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 543–559. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_33CrossRefGoogle Scholar
  45. 45.
    Zhang, K., Tan, L., Li, Z., Qiao, Y.: Gender and smile classification using deep convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), pp. 34–38 (2016)Google Scholar
  46. 46.
    Zhang, X., Yu, F.X., Chang, S.F., Wang, S.: Deep transfer network: unsupervised domain adaptation. arXiv preprint arXiv:1503.00591 (2015)
  47. 47.
    Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Men also like shopping: reducing gender bias amplification using corpus-level constraints. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2017)Google Scholar
  48. 48.
    Zintgraf, L.M., Cohen, T.S., Adel, T., Welling, M.: Visualizing deep neural network decisions: prediction difference analysis. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.UC BerkeleyBerkeleyUSA
  2. 2.Boston UniversityBostonUSA

Personalised recommendations