Skip to main content
Log in

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://jabir-zheng.github.io/MMoT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availibility

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  • Bond-Taylor, S., Leach, A., Long, Y., & Willcocks, C. G. (2021). Deep generative modelling: a comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. Transactions on Pattern Analysis and Machine Intelligence, 44(11), 7327–7347.

    Article  Google Scholar 

  • Caesar, H., Uijlings, J., & Ferrari, V. (2018). Cocostuff: Thing and stuff classes in context. In Conference on computer vision and pattern recognition (pp. 1209–1218).

  • Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., Rubinstein, M., Li, Y., & Krishnan, D. (2023). Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704

  • Chang, H., Zhang, H., Jiang, L., Liu, C., & Freeman, W.T. (2022). Maskgit: Masked generative image transformer. In Conference on computer vision and pattern recognition (pp. 11315– 11325).

  • Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020). Generative pretraining from pixels. In International conference on machine learning (pp. 1691– 1703).

  • Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509

  • Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.

    Google Scholar 

  • Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Conference on computer vision and pattern recognition (pp. 12873–12883).

  • Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-ascene: Scene-based text-to-image generation with human priors. In European conference on computer vision (pp. 89–106).

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.

    Article  MathSciNet  Google Scholar 

  • He, S., Liao, W., Yang, M.Y., Yang, Y., Song, Y.-Z., Rosenhahn, B., & Xiang, T. (2021). Context-aware layout to image generation with enhanced object appearance. In Conference on computer vision and pattern recognition (pp. 15049–15058).

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30, 6626–6637.

    Google Scholar 

  • Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.

    Article  Google Scholar 

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

    Google Scholar 

  • Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., & Zhou, J. (2023). Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778

  • Huang, X., Mallya, A., Wang, T.-C., & Liu, M.-Y. (2022). Multimodal conditional image synthesis with product-of-experts GANs. In European conference on computer vision (pp. 91–109).

  • Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., & Huang, L. (2021). What makes multimodal learning better than single (provably). Advances in Neural Information Processing Systems, 34, 10944–10956.

  • Huang, Y., Lin, J., Zhou, C., Yang, H., & Huang, L. (2022). Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In International conference on machine learning (pp. 9226–9259).

  • Ismail, A.A., Hasan, M., & Ishtiaq, F. (2020). Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:2011.06102

  • Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Conference on computer vision and pattern recognition (pp. 1125–1134).

  • Jahn, M., Rombach, R., Ommer, B. (2021). Highresolution complex scene synthesis with transformers. In Conference on computer vision and pattern recognition workshop.

  • Kingma, D. P., & Welling, M. (2014). Autoencoding variational Bayes. In International conference on learning representations.

  • Kobyzev, I., Prince, S. J., & Brubaker, M. A. (2020). Normalizing flows: An introduction and review of current methods. Transactions on Pattern Analysis and Machine Intelligence, 43(11), 3964–3979.

    Article  Google Scholar 

  • LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006). A tutorial on energy-based learning. Predicting Structured Data, 1(0).

  • Li, Z., Wu, J., Koh, I., Tang, Y., & Sun, L. (2021). Image synthesis from layout with locality-aware mask adaption. In International conference on computer vision (pp. 13819–13828).

  • Li, Z., Zhou, H., Bai, S., Li, P., Zhou, C., & Yang, H. (2022). M6-fashion: High-fidelity multimodal image generation and editing. arXiv preprint arXiv:2205.11705

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755).

  • Liu, X., Yin, G., Shao, J., Wang, X., & Li, H. (2019). Learning to predict layout-to-image conditional convolutions for semantic image synthesis. Advances in Neural Information Processing Systems, 32, 568–578.

    Google Scholar 

  • Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. In International conference on learning representations.

  • Ma, M., Ren, J., Zhao, L., Testuggine, D., & Peng, X. (2022). Are multimodal transformers robust to missing modality? In Conference on computer vision and pattern recognition (pp. 18177–18186).

  • Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. In Advances in neural information processing systems workshop.

  • Miyato, T., & Koyama, M. (2018). cgans with projection discriminator. In International conference on learning representations.

  • Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X. (2023). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453

  • Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier GANs. In International conference on machine learning (pp. 2642–2651).

  • Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021). Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1), 2617–2680.

    MathSciNet  Google Scholar 

  • Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.- Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Conference on computer vision and pattern recognition (pp. 2337–2346).

  • Parmar, G., Zhang, R., & Zhu, J.-Y. (2021). On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222

  • Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. In International conference on machine learning (pp. 4055–4064).

  • Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022). Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8238–8247).

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763).

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125

  • Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In International conference on machine learning (pp. 8821–8831).

  • Razavi, A., Van den Oord, A., & Vinyals, O. (2019). Generating diverse high-fidelity images with VQ-VAE-2. Advances in Neural Information Processing Systems, 32, 14837–14847.

    Google Scholar 

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Conference on computer vision and pattern recognition (pp. 10684–10695).

  • Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Lopes, R. G., Ayan, B. K., Salimans, T., Ho, J., Fleet, D. J., & Norouzi, M. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494.

    Google Scholar 

  • Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2226–2234.

    Google Scholar 

  • Simo-Serra, E., Iizuka, S., Sasaki, K., & Ishikawa, H. (2016). Learning to simplify: Fully convolutional networks for rough sketch cleanup. ACM Transactions on Graphics, 35(4), 1–11.

    Article  Google Scholar 

  • Skorokhodov, I., Sotnikov, G., & Elhoseiny, M. (2021). Aligning latent and image spaces to connect the unconnectable. In International conference on computer vision (pp. 14144–14153).

  • Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in Neural Information Processing Systems, 28, 3483–3491.

    Google Scholar 

  • Sun, W., & Wu, T. (2021). Learning layout and style reconfigurable GANs for controllable image synthesis. Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5070–5087.

    Google Scholar 

  • Sun, Y., Mai, S., & Hu, H. (2021). Learning to balance the learning rates between various modalities via adaptive tracking factor. IEEE Signal Processing Letters, 28, 1650–1654.

    Article  Google Scholar 

  • Sushko, V., Schönfeld, E., Zhang, D., Gall, J., Schiele, B., & Khoreva, A. (2022). Oasis: Only adversarial supervision for semantic image synthesis. International Journal of Computer Vision, 130(12), 2903–2923.

    Article  Google Scholar 

  • Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., & Lempitsky, V. (2022). Resolution-robust large mask inpainting with Fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2149–159).

  • Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R. D., & Sharma, S. (2021). Object-centric image generation from layouts. In AAAI conference on artificial intelligence (vol. 35, pp. 2647–2655).

  • Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y., Wu, F., & Bao, B. (2020). DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865

  • Van Den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 6306–6315.

    Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

    Google Scholar 

  • Wang, T., Zhang, T., Zhang, B., Ouyang, H., Chen, D., Chen, Q., & Wen, F. (2022). Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952

  • Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional GANs. In Conference on computer vision and pattern recognition (pp. 8798–8807).

  • Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12695–12705).

  • Wu, C., Liang, J., Hu, X., Gan, Z., Wang, J., Wang, L., Liu, Z., Fang, Y., & Duan, N. (2022). Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:2207.09814

  • Wu, C., Liang, J., Ji, L., Yang, F., Fang, Y., Jiang, D., & Duan, N. (2022). Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision (pp. 720–736).

  • Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In International conference on computer vision (pp. 1395–1403).

  • Yang, C., Shen, Y., & Zhou, B. (2021). Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision, 129, 1451–1466.

    Article  Google Scholar 

  • Yang, Z., Liu, D., Wang, C., Yang, J., & Tao, D. (2022). Modeling image composition for complex scene generation. In Conference on computer vision and pattern recognition (pp. 7764–7773).

  • Ye, H., Yang, X., Takac, M., Sunderraman, R., & Ji, S. (2021). Improving text-to-image synthesis using contrastive learning. In British machine vision conference.

  • Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J., & Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789

  • Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543

  • Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).

  • Zhang, Z., Ma, J., Zhou, C., Men, R., Li, Z., Ding, M., & Yang, H. (2021). UFC-BERT: Unifying multi-modal controls for conditional image synthesis. Advances in Neural Information Processing Systems, 34, 27196–27208.

    Google Scholar 

  • Zhao, B., Yin, W., Meng, L., & Sigal, L. (2020). Layout2image: Image generation from layout. International Journal of Computer Vision, 128, 2418–2435.

    Article  Google Scholar 

  • Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., & Sun, T. (2021). Lafite: Towards language-free training for text-to-image generation. In Conference on computer vision and pattern recognition.

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62076101, Guangdong Basic and Applied Basic Research Foundation under Grant 2023A1515010007, the Guangdong Provincial Key Laboratory of Human Digital Twin under Grant 2022B1212010004.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chaoyue Wang or Changxing Ding.

Additional information

Communicated by Jiri Matas.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

More Qualitative Results

More Qualitative Results

1.1 Qualitative Comparisons with UCIS Models

Fig. 12
figure 12

Additional qualitative comparison of text-to-image synthesis on COCO-Stuff

Fig. 13
figure 13

Additional qualitative comparison of segmentation mask-to-image synthesis on COCO-Stuff

Fig. 14
figure 14

Additional qualitative comparison of sketch-to-image synthesis on COCO-Stuff

Fig. 15
figure 15

Additional qualitative comparison of bounding boxes-to-image synthesis on COCO-Stuff

In Figs. 12, 13, 14, and 15, we show additional qualitative comparisons with a wide range of UCIS models when conditioned on text, segmentation mask, sketch, and bounding boxes, respectively. Competitive visual results compared with UCIS models specially designed for a single modality indicate MMoT is robust to different modalities.

1.2 Qualitative CMCIS Examples

Fig. 16
figure 16

Examples of composed multimodal conditional image synthesis when conditioned on text and segmentation mask. From left to right: text, segmentation, a random sample from PoE-GAN, and two random samples from our MMoT. PoE-GAN always struggles with the modality imbalance problem. In contrast, MMoT can balance the information of the two modalities to synthesize images

Fig. 17
figure 17

Examples of composed multimodal conditional image synthesis when conditioned on text and sketch. From left to right: text, sketch, a random sample from PoE-GAN, and two random samples from our MMoT. PoE-GAN always struggles with the modality imbalance problem. In contrast, MMoT can balance the information of the two modalities to synthesize images

Fig. 18
figure 18

Examples of composed multimodal conditional image synthesis when conditioned on segmentation and sketch. From left to right: segmentation mask, sketch, a random sample from PoE-GAN, and three random samples from our MMoT. PoE-GAN always struggles with the modality coordination problem. In contrast, MMoT can generate more spatially coordinated images

Fig. 19
figure 19

Examples of composed multimodal conditional image synthesis. We show three random samples from MMoT conditioned on compositions of different modalities (from top to bottom: text+segmentation mask, text+sketch, and text+bounding boxes)

Fig. 20
figure 20

Examples of composed multimodal conditional image synthesis. We show three random samples from MMoT conditioned on compositions of different modalities (from top to bottom: segmentation mask+sketch, bounding boxes+sketch, and segmentation mask+sketch+bounding boxes)

Figures 16, 17, 18, 19, and 20 show that MMoT can generate high-quality, faithful, and diverse images when conditioned on complex compositions of two or three different modalities.

In Figs. 16, 17, and 18, we also show more visual comparisons with PoE-GAN when conditioned on compositions of text+segmentation mask, text+sketch, and segmentation mask+sketch, respectively. The modality coordination problem and the modality imbalance problem are common in MCIS models when conditioned on complex multimodal conditions. In contrast, MMoT addresses both issues and can synthesize high-quality and faithful images.

Modality coordination problem. The modality coordination problem is caused by the nonadaptive fusion of fine-grained information across multiple modalities. As illustrated in Fig. 18, when PoE-GAN synthesizes an image, the generated contents from the sketch condition are incorrectly composed with the generated contents from the segmentation mask condition.

Modality imbalance problem. The modality imbalance problem is caused by the imbalanced distribution of each modality in datasets. As illustrated in Figs. 16 and 17, PoE-GAN tends to ignore text inputs when generating images.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, J., Liu, D., Wang, C. et al. MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02044-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02044-4

Keywords

Navigation