Skip to main content

MultiMAE: Multi-modal Multi-task Masked Autoencoders

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides the RGB image (hence “multi-modal”), and II) its training objective accordingly includes predicting multiple outputs besides the RGB image (hence “multi-task”). We make use of masking (across image patches and input modalities) to make training MultiMAE tractable as well as to ensure cross-modality predictive coding is indeed learned by the network. We show this pre-training strategy leads to a flexible, simple, and efficient framework with improved transfer results to downstream tasks. In particular, the same exact pre-trained network can be flexibly used when additional information besides RGB images is available or when no information other than RGB is available - in all configurations yielding competitive to or significantly better results than the baselines. To avoid needing training datasets with multiple modalities and tasks, we train MultiMAE entirely using pseudo labeling, which makes the framework widely applicable to any RGB dataset. The experiments are performed on multiple transfer tasks (image classification, semantic segmentation, depth estimation) and datasets (ImageNet, ADE20K, Taskonomy, Hypersim, NYUv2). The results show an intriguingly impressive capability by the model in cross-modal/task predictive coding and transfer. Code, pre-trained models, and interactive visualizations are available at https://multimae.epfl.ch.

R. Bachmann and D. Mizrahi—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahmed, S.A.A., Awais, M., Kittler, J.: Sit: Self-supervised vision transformer. ArXiv abs/2104.03602 (2021)

    Google Scholar 

  2. Akbari, H., et al.: Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Inf. Process. Syst. 34, 24206–24221 (2021)

    Google Scholar 

  3. Alayrac, J.B.: Self-supervised multimodal versatile networks. Adv. Neural Inf. Process. Syst. 33, 25–37 (2020)

    Google Scholar 

  4. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 609–617 (2017)

    Google Scholar 

  5. Atito, S., Awais, M., Kittler, J.: Sit: self-supervised vision transformer. arXiv preprint arXiv:2104.03602 (2021)

  6. Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555 (2022)

  7. Bao, H., Dong, L., Wei, F.: Beit: Bert pre-training of image transformers. ArXiv abs/2106.08254 (2021)

    Google Scholar 

  8. Baxter, J.: A model of inductive bias learning. J. Artif. Intell. Res. 12, 149–198 (2000)

    Article  MathSciNet  Google Scholar 

  9. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640 (2021)

    Google Scholar 

  10. Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997). https://doi.org/10.1023/A:1007379606734. Jul

    Article  MathSciNet  Google Scholar 

  11. Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., Torralba, A.: Learning aligned cross-modal representations from weakly aligned data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2016)

    Google Scholar 

  12. Chen, L.-C., et al.: Naive-student: leveraging semi-supervised learning in video sequences for urban scene segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 695–714. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_40

    Chapter  Google Scholar 

  13. Chen, M., et al.: Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning, pp. 1691–1703. PMLR (2020). iSSN: 2640–3498

    Google Scholar 

  14. Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9620–9629 (2021)

    Google Scholar 

  15. Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  16. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. ArXiv abs/2112.01527 (2021)

    Google Scholar 

  17. De Sa, V.R.: Sensory modality segregation. In: NIPS, pp. 913–920. Citeseer (2003)

    Google Scholar 

  18. De Sa, V.R., Ballard, D.H.: Category learning through multimodality sensing. Neural Comput. 10(5), 1097–1117 (1998)

    Article  Google Scholar 

  19. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  20. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  21. Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060 (2017)

    Google Scholar 

  22. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. ArXiv abs/2010.11929 (2021)

    Google Scholar 

  23. Eftekhar, A., Sax, A., Bachmann, R., Malik, J., Zamir, A.R.: Omnidata: a scalable pipeline for making multi-task mid-level vision datasets from 3D scans. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10766–10776 (2021)

    Google Scholar 

  24. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)

    Google Scholar 

  25. El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021)

  26. Ghiasi, G., Zoph, B., Cubuk, E.D., Le, Q.V., Lin, T.Y.: Multi-task self-training for learning general representations. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8836–8845 (2021)

    Google Scholar 

  27. Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: Omnivore: a single model for many visual modalities. arXiv preprint arXiv:2201.08377 (2022)

  28. He, K., Chen, X., Xie, S., Li, Y., Doll’ar, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. ArXiv abs/2111.06377 (2021)

    Google Scholar 

  29. Hu, R., Singh, A.: Unit: multimodal multitask learning with a unified transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1439–1449 (2021)

    Google Scholar 

  30. Jaegle, A., et al.: Perceiver io: a general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021)

  31. Kaiser, L., et al.: One model to learn them all. arXiv preprint arXiv:1706.05137 (2017)

  32. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)

    Google Scholar 

  33. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

    Google Scholar 

  34. Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)

    Google Scholar 

  35. Kokkinos, I.: Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6129–6138 (2017)

    Google Scholar 

  36. Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning. ICML (2013)

    Google Scholar 

  37. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  38. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  39. Loshchilov, I., Hutter, F.: Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

  40. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  41. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 32, 1–11 (2019)

    Google Scholar 

  42. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)

    Google Scholar 

  43. Mensink, T., Uijlings, J.R.R., Kuznetsova, A., Gygli, M., Ferrari, V.: Factors of influence for transfer learning across diverse appearance domains and task types. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google Scholar 

  44. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst. 34, 14200–14213 (2021)

    Google Scholar 

  45. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.: Multimodal deep learning. In: ICML (2011)

    Google Scholar 

  46. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 631–648 (2018)

    Google Scholar 

  47. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)

    Google Scholar 

  48. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)

    Google Scholar 

  49. Pham, H., Dai, Z., Xie, Q., Le, Q.V.: Meta pseudo labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11557–11568 (2021)

    Google Scholar 

  50. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12159–12168 (2021)

    Google Scholar 

  51. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1623–1637 (2022)

    Article  Google Scholar 

  52. Roberts, M., Paczan, N.: Hypersim: a photorealistic synthetic dataset for holistic indoor scene understanding. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10892–10902 (2021)

    Google Scholar 

  53. Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models. In: IEEE Workshops on Applications of Computer Vision (WACV/MOTION 2005) (2005)

    Google Scholar 

  54. Sax, A., Emi, B., Zamir, A.R., Guibas, L.J., Savarese, S., Malik, J.: Mid-level visual representations improve generalization and sample efficiency for learning visuomotor policies. (2018)

    Google Scholar 

  55. Scudder, H.: Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory 11(3), 363–371 (1965)

    Article  MathSciNet  Google Scholar 

  56. Shi, Y., Siddharth, N., Paige, B., Torr, P.H.S.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. ArXiv abs/1911.03393 (2019)

    Google Scholar 

  57. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  58. Smith, L., Gasser, M.: The development of embodied cognition: six lessons from babies. Artif. Life 11(1–2), 13–29 (2005)

    Article  Google Scholar 

  59. Su, W., et al.: Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)

  60. Sutter, T.M., Daunhawer, I., Vogt, J.E.: Multimodal generative learning utilizing jensen-shannon-divergence. ArXiv abs/2006.08242 (2019)

    Google Scholar 

  61. Sutter, T.M., Daunhawer, I., Vogt, J.E.: Generalized multimodal ELBO. CoRR abs/2105.02470 (2021). https://arxiv.org/abs/2105.02470

  62. Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)

  63. Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P.: Rethinking few-shot image classification: a good embedding is all you need? ArXiv abs/2003.11539 (2020)

    Google Scholar 

  64. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)

    Google Scholar 

  65. Tripuraneni, N., Jordan, M., Jin, C.: On the theory of transfer learning: the importance of task diversity. Adv. Neural Inf. Process. Syst. 33, 7852–7862 (2020)

    Google Scholar 

  66. Tripuraneni, N., Jordan, M.I., Jin, C.: On the theory of transfer learning: the importance of task diversity. ArXiv abs/2006.11650 (2020)

    Google Scholar 

  67. Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., Van Gool, L.: Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google Scholar 

  68. Vaswani, A., et al.: Attention is all you need. ArXiv abs/1706.03762 (2017)

    Google Scholar 

  69. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(110), 3371–3408 (2010). http://jmlr.org/papers/v11/vincent10a.html

  70. Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133 (2021)

  71. Wu, M., Goodman, N.D.: Multimodal generative models for scalable weakly-supervised learning. In: NeurIPS (2018)

    Google Scholar 

  72. Xiao, T., Radosavovic, I., Darrell, T., Malik, J.: Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173 (2022)

  73. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698 (2020)

    Google Scholar 

  74. Xie, Z., et al.: Simmim: a simple framework for masked image modeling. ArXiv abs/2111.09886 (2021)

    Google Scholar 

  75. Xu, H., et al.: E2e-vlp: end-to-end vision-language pre-training enhanced by visual learning. arXiv preprint arXiv:2106.01804 (2021)

  76. Yalniz, I.Z., Jégou, H., Chen, K., Paluri, M., Mahajan, D.: Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546 (2019)

  77. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: ACL (1995)

    Google Scholar 

  78. Yin, W., et al.: Learning to recover 3D scene shape from a single image. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 204–213 (2021)

    Google Scholar 

  79. Zamir, A.R., Sax, A., Shen, W.B., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: disentangling task transfer learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)

    Google Scholar 

  80. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5122–5130 (2017)

    Google Scholar 

  81. Zhou, J., et al.: ibot: image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)

  82. Zoph, B., et al.: Rethinking pre-training and self-training. Adv. Neural Inf. Process. Syst. 33, 3833–3845 (2020)

    Google Scholar 

Download references

Acknowledgments

We thank Stefan Stepanovic and Alexander Sax for their help and insightful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Bachmann .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 15028 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A. (2022). MultiMAE: Multi-modal Multi-task Masked Autoencoders. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19836-6_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19835-9

  • Online ISBN: 978-3-031-19836-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics