CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection


Cross-domain visual problems, such as image-to-image translation and domain adaptive object detection, have attracted increasing attentions in the last few years, and also become new rising and challenging directions for the computer vision community. Recently, despite enormous efforts of the field in data collection, there are still few datasets covering the instance-level image-to-image translation and domain adaptive object detection tasks simultaneously. In this work, we introduce a large-scale cross-domain benchmark CDTD (contains 155,529 high-resolution natural images across four different modalities with object bounding box annotations. A summary of the entire dataset is provided in the following sections. Dataset is available at: for the new instance-level translation and object detection tasks. We provide comprehensive baseline results of the benchmark on both of these two tasks. Moreover, we proposed a novel instance-level image-to-image translation approach called INIT and a gradient detach method for the domain adaptive object detection to harvest and exert dataset’s function of the instance level annotations across different domains.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17


  1. 1.

    The abbreviation of A C ross-D omain Benchmark for T ranslation and D etection tasks.

  2. 2.

    For safety, we collect the rainy images after the rain, so this category looks more like overcast weather with wet road.

  3. 3.

  4. 4.


  1. Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., & Courville, A. (2018). Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML.

  2. Bai, Y., Zhang, Y., Ding, M., & Ghanem, B. (2018). Finding tiny faces in the wild with generative adversarial network. In CVPR.

  3. Cai, Q., Pan, Y., Ngo, C.W., Tian, X., Duan, L., & Yao, T. (2019). Exploring object relation in mean teacher for cross-domain detection. In CVPR.

  4. Chen, Y., Li, W., Sakaridis, C., Dai, D., & Van Gool, L. (2018). Domain adaptive faster r-cnn for object detection in the wild. In CVPR

  5. Cheung, B., Livezey, J. A., Bansal, A. K., & Olshausen, B.A. (2015). Discovering hidden factors of variation in deep networks. In ICLR workshop.

  6. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

  8. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Centernet: Object detection with keypoint triplets. arXiv preprint arXiv:1904.08189.

  9. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision,88(2), 303–338.

  10. Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In ICML.

  11. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., et al. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1), 2096–2030.

    MathSciNet  MATH  Google Scholar 

  12. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR).

  13. Girshick, R. (2015). Fast R-CNN. In ICCV.

  14. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NIPS.

  15. He, K., Girshick, R., & Dollár, P. (2019). Rethinking imagenet pre-training. In: Proceedings of the IEEE international conference on computer vision (pp. 4918–4927).

  16. He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In ICCV.

  17. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  18. He, Z., & Zhang, L. (2019). Multi-adversarial faster-rcnn for unrestricted object detection. In ICCV.

  19. Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A., & Darrell, T. (2018). Cycada: Cycle-consistent adversarial domain adaptation. In ICML.

  20. Huang, X., & Belongie, S.J. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV.

  21. Huang, X., Liu, M.Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV.

  22. Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4), 1–14.

    Article  Google Scholar 

  23. Inoue, N., Furuta, R., Yamasaki, T., & Aizawa, K. (2018). Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR.

  24. Isola, P., Zhu, J.Y., Zhou, T., & Efros, A.A. (2017). Image-to-image translation with conditional adversarial networks. In IEEE conference on computer vision and pattern recognition.

  25. Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., & Vasudevan, R. (2016). Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv preprint arXiv:1610.01983.

  26. Karacan, L., Akata, Z., Erdem, A., & Erdem, E. (2016). Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215.

  27. Kim, T., Jeong, M., Kim, S., Choi, S., & Kim, C. (2019). Diversify and match: A domain adaptive representation learning paradigm for object detection. In CVPR.

  28. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  29. Laffont, P. Y., Ren, Z., Tao, X., Qian, C., & Hays, J. (2014). Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on Graphics Proceedings of SIGGRAPH, 33(4), 1–11.

    Article  Google Scholar 

  30. Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In ECCV.

  31. Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., & Yang, M.H. (2018). Diverse image-to-image translation via disentangled representations. In ECCV.

  32. Li, T., Qian, R., Dong, C., Liu, S., Yan, Q., Zhu, W., & Lin, L. (2018). Beautygan: Instance-level facial makeup transfer with deep generative adversarial network. In 2018 ACM multimedia conference on multimedia conference (pp. 645–653). ACM.

  33. Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., & Belongie, S.J. (2017). Feature pyramid networks for object detection. In CVPR.

  34. Lin, T.Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In ICCV.

  35. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.

  36. Liu, M.Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In NIPS.

  37. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A.C. (2016). SSD: Single shot multibox detector. In ECCV.

  38. Long, M., Zhu, H., Wang, J., & Jordan, M.I. (2016). Unsupervised domain adaptation with residual transfer networks. In Advances in neural information processing systems.

  39. Ma, S., Fu, J., Chen, C.W., & Mei, T. (2018). Da-gan: Instance-level image translation by deep attention generative adversarial networks. In CVPR.

  40. Maaten, L.v.d., & Hinton, G., (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9, 2579–2605.

  41. Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., & LeCun, Y. (2016). Disentangling factors of variation in deep representation using adversarial training. In NIPS

  42. Mechrez, R., Talmi, I., & Zelnik-Manor, L. (2018). The contextual loss for image transformation with non-aligned data. In Proceedings of the European conference on computer vision (ECCV), 768–783.

  43. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.

  44. Mo, S., Cho, M., & Shin, J. (2019). Instance-aware image-to-image translation. In International conference on learning representations.

  45. Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., & Kim, K. (2018) Image to image translation for domain adaptation. In CVPR.

  46. Nguyen, V., Vicente, Y., Tomas, F., Zhao, M., Hoai, M., & Samaras, D. (2017). Shadow detection with conditional generative adversarial networks. In ICCV.

  47. Panareda Busto, P., & Gall, J. (2017). Open set domain adaptation. In ICCV.

  48. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS workshop.

  49. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2536–2544).

  50. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., & Wang, B. (2019). Moment matching for multi-source domain adaptation. In Proceedings of the IEEE international conference on computer vision (pp. 1406–1415).

  51. Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., & Saenko, K. (2017). Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924.

  52. Peng, X., Usman, B., Saito, K., Kaushik, N., Hoffman, J., & Saenko, K. (2018). Syn2real: A new benchmark forsynthetic-to-real visual domain adaptation. arXiv preprint arXiv:1806.09755.

  53. Radim Tyleček, R. Š. (2013). Spatial pattern templates for recognition of objects with regular structure. Saarbrucken, Germany: In Proceeding GCPR.

  54. Raj, A., Namboodiri, V. P., & Tuytelaars, T. (2015). Subspace alignment based domain adaptation for rcnn detector. arXiv preprint arXiv:1507.05578.

  55. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.

  56. Saito, K., Ushiku, Y., Harada, T., & Saenko, K. (2019). Strong-weak distribution alignment for adaptive object detection. In CVPR.

  57. Sakaridis, C., Dai, D., & Van Gool, L. (2018). Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126(9), 973–992.

    Article  Google Scholar 

  58. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In NIPS.

  59. Sangkloy, P., Lu, J., Fang, C., Yu, F., & Hays, J. (2017). Scribbler: Controlling deep image synthesis with sketch and color. In CVPR.

  60. Shen, Z., He, Z., & Xue, X. (2019). Meal: Multi-model ensemble via adversarial learning. In AAAI.

  61. Shen, Z., Huang, M., Shi, J., Xue, X., & Huang, T. (2019). Towards instance-level image-to-image translation. In CVPR.

  62. Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., & Xue, X. (2017). Dsod: Learning deeply supervised object detectors from scratch. In ICCV.

  63. Shen, Z., Liu, Z., Li, J., Jiang, Y. G., Chen, Y., & Xue, X. (2019). Object detection from scratch with deep supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 398–412.

    Article  Google Scholar 

  64. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In CVPR.

  65. Tzeng, E., Burns, K., Saenko, K., & Darrell, T. (2018). Splat: Semantic pixel-level adaptation transforms for detection. arXiv preprint arXiv:1812.00929.

  66. Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In CVPR.

  67. Venkateswara, H., Eusebio, J., Chakraborty, S., & Panchanathan, S. (2017). Deep hashing network for unsupervised domain adaptation. In (IEEE) conference on computer vision and pattern recognition (CVPR).

  68. Wang, X., Cai, Z., Gao, D., & Vasconcelos, N. (2019). Towards universal object detection by domain attention. In CVPR.

  69. Wu, Y., Winston, E., Kaushik, D., & Lipton, Z. (2019). Domain adaptation with asymmetrically-relaxed distribution alignment. In ICML.

  70. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. S. (2018). Generative image inpainting with contextual attention. In The IEEE conference on computer vision and pattern recognition (CVPR).

  71. Zhang, R., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.

  72. Zhang, Z., Yang, L., & Zheng, Y. (2018). Translating and segmenting multimodal medical volumes with cycle-and shapeconsistency generative adversarial network. In CVPR.

  73. Zhao, H., Des Combes, R. T., Zhang, K., & Gordon, G. (2019). On learning invariant representations for domain adaptation. In ICML.

  74. Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In CVPR.

  75. Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE international conference on computer vision (ICCV).

  76. Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017). Toward multimodal image-to-image translation. In Advances in neural information Processing Systems.

  77. Zhu, R., Zhang, S., Wang, X., Wen, L., Shi, H., Bo, L., & Mei, T. (2019). Scratchdet: Training single-shot object detectors from scratch. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2268–2277).

Download references

Author information



Corresponding author

Correspondence to Zhiqiang Shen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Dengxin Dai.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shen, Z., Huang, M., Shi, J. et al. CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection. Int J Comput Vis (2020).

Download citation


  • Cross-domain benchmark
  • Instance level image-to-image translation
  • Domain adaptive object detection