Exploring the Limits of Weakly Supervised Pretraining

  • Dhruv Mahajan
  • Ross GirshickEmail author
  • Vignesh Ramanathan
  • Kaiming He
  • Manohar Paluri
  • Yixuan Li
  • Ashwin Bharambe
  • Laurens van der Maaten
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11206)


State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.



We thank Matthijs Douze, Aapo Kyrola, Andrew Dye, Jerry Pan, Kevin Wilfong, Martin Englund, and Zeki Yalniz for helpful discussions and support.

Supplementary material

474176_1_En_12_MOESM1_ESM.pdf (917 kb)
Supplementary material 1 (pdf 916 KB)


  1. 1.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  2. 2.
    Donahue, J., et al.: DeCAF: a deep convolutional activation feature for generic visual recognition. arXiv:1310.1531 (2013)
  3. 3.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. In: ECCV (2014)Google Scholar
  4. 4.
    He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  5. 5.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  6. 6.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)Google Scholar
  7. 7.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)Google Scholar
  8. 8.
    Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)Google Scholar
  9. 9.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  10. 10.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)Google Scholar
  11. 11.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014 Part VII. LNCS, vol. 8695, pp. 329–344. Springer, Cham (2014). Scholar
  13. 13.
    Huh, M., Agrawal, P., Efros, A.: What makes ImageNet good for transfer learning? arXiv:1608.08614 (2016)
  14. 14.
    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. PAMI 40, 1452–1464 (2017)CrossRefGoogle Scholar
  15. 15.
    Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)Google Scholar
  16. 16.
    Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part VII. LNCS, vol. 9911, pp. 67–84. Springer, Cham (2016). Scholar
  17. 17.
    Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of ICCV (2017)Google Scholar
  18. 18.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014 Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  19. 19.
    Goyal, P., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour. In: arXiv:1706.02677 (2017)
  20. 20.
    WordNet: About WordNet (2010).
  21. 21.
    Welinder, P., et al.: Caltech-UCSD Birds 200. Technical report, Caltech (2010)Google Scholar
  22. 22.
    Gordo, A., Almazan, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: arXiv:1604.01325 (2016)CrossRefGoogle Scholar
  23. 23.
    Tolias, G., Sicre, R., Jegou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: ICLR (2016)Google Scholar
  24. 24.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  25. 25.
    Huang, G., Liu, Z., Weinberger, K., van der Maaten, L.: Densely connected convolutional networks. In: CVPR. (2017)Google Scholar
  26. 26.
    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-V4, inception-ResNet and the impact of residual connections on learning. In: arXiv:1602.07261 (2016)
  27. 27.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  28. 28.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: ICCV (2015)Google Scholar
  29. 29.
    Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: CVPR (2017)Google Scholar
  30. 30.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  31. 31.
    Zoph, B., Vasudevan, V., Shlens, J., Le, Q.: Learning transferable architectures for scalable image recognition. In: arXiv:1707.07012 (2017)
  32. 32.
    Stock, P., Cisse, M.: Convnets and ImageNet beyond accuracy: explanations, bias detection, adversarial examples and model criticism. In: arXiv:1711.11443 (2017)
  33. 33.
    Izadinia, H., Russell, B., Farhadi, A., Hoffman, M., Hertzmann, A.: Deep classifiers from image tags in the wild. In: Multimedia COMMONS (2015)Google Scholar
  34. 34.
    Misra, I., Zitnick, C.L., Mitchell, M., Girshick, R.: Seeing through the human reporting bias: visual classifiers from noisy human-centric labels. In: CVPR (2016)Google Scholar
  35. 35.
    Veit, A., Nickel, M., Belongie, S., van der Maaten, L.: Separating self-expression and visual content in hashtag supervision. In: arXiv 1711.09825 (2017)Google Scholar
  36. 36.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)Google Scholar
  37. 37.
    Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., He, K.: Detectron (2018).
  38. 38.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)Google Scholar
  39. 39.
    Li, A., Jabri, A., Joulin, A., van der Maaten, L.: Learning visual n-grams from web data. In: Proceedings of ICCV (2017)Google Scholar
  40. 40.
    Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)CrossRefGoogle Scholar
  41. 41.
    Gross, S., Ranzato, M., Szlam, A.: Hard mixtures of experts for large scale weakly supervised vision. In: CVPR (2017)Google Scholar
  42. 42.
    Denton, E., Weston, J., Paluri, M., Bourdev, L., Fergus, R.: User conditional hashtag prediction for images. In: Proceedings of KDD, pp. 1731–1740 (2015)Google Scholar
  43. 43.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR (2015)Google Scholar
  44. 44.
    Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Web-scale training for face identification. In: CVPR (2015)Google Scholar
  45. 45.
    Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
  46. 46.
    Stewénius, H., Gunderson, S.H., Pilet, J.: Size matters: exhaustive geometric verification for image retrieval accepted for ECCV 2012. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012 Part II. LNCS, pp. 674–687. Springer, Heidelberg (2012). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Dhruv Mahajan
    • 1
  • Ross Girshick
    • 1
    Email author
  • Vignesh Ramanathan
    • 1
  • Kaiming He
    • 1
  • Manohar Paluri
    • 1
  • Yixuan Li
    • 1
  • Ashwin Bharambe
    • 1
  • Laurens van der Maaten
    • 1
  1. 1.FacebookMenlo ParkUSA

Personalised recommendations