CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection

Shen, Zhiqiang; Huang, Mingyang; Shi, Jianping; Liu, Zechun; Maheshwari, Harsh; Zheng, Yutong; Xue, Xiangyang; Savvides, Marios; Huang, Thomas S.

doi:10.1007/s11263-020-01394-z

CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection

Published: 24 November 2020

Volume 129, pages 761–780, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Zhiqiang Shen ORCID: orcid.org/0000-0002-4560-5092¹,
Mingyang Huang²,
Jianping Shi²,
Zechun Liu¹,
Harsh Maheshwari¹,
Yutong Zheng¹,
Xiangyang Xue³,
Marios Savvides¹ &
…
Thomas S. Huang⁴

1916 Accesses
13 Citations
Explore all metrics

Abstract

Cross-domain visual problems, such as image-to-image translation and domain adaptive object detection, have attracted increasing attentions in the last few years, and also become new rising and challenging directions for the computer vision community. Recently, despite enormous efforts of the field in data collection, there are still few datasets covering the instance-level image-to-image translation and domain adaptive object detection tasks simultaneously. In this work, we introduce a large-scale cross-domain benchmark CDTD (contains 155,529 high-resolution natural images across four different modalities with object bounding box annotations. A summary of the entire dataset is provided in the following sections. Dataset is available at: http://zhiqiangshen.com/projects/INIT/index.html.) for the new instance-level translation and object detection tasks. We provide comprehensive baseline results of the benchmark on both of these two tasks. Moreover, we proposed a novel instance-level image-to-image translation approach called INIT and a gradient detach method for the domain adaptive object detection to harvest and exert dataset’s function of the instance level annotations across different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

Notes

The abbreviation of A C ross-D omain Benchmark for T ranslation and D etection tasks.
For safety, we collect the rainy images after the rain, so this category looks more like overcast weather with wet road.
https://github.com/NVlabs/MUNIT.
https://github.com/facebookresearch/maskrcnn-benchmark.

References

Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., & Courville, A. (2018). Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML.
Bai, Y., Zhang, Y., Ding, M., & Ghanem, B. (2018). Finding tiny faces in the wild with generative adversarial network. In CVPR.
Cai, Q., Pan, Y., Ngo, C.W., Tian, X., Duan, L., & Yao, T. (2019). Exploring object relation in mean teacher for cross-domain detection. In CVPR.
Chen, Y., Li, W., Sakaridis, C., Dai, D., & Van Gool, L. (2018). Domain adaptive faster r-cnn for object detection in the wild. In CVPR
Cheung, B., Livezey, J. A., Bansal, A. K., & Olshausen, B.A. (2015). Discovering hidden factors of variation in deep networks. In ICLR workshop.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Centernet: Object detection with keypoint triplets. arXiv preprint arXiv:1904.08189.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision,88(2), 303–338.
Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In ICML.
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., et al. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1), 2096–2030.
MathSciNet MATH Google Scholar
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR).
Girshick, R. (2015). Fast R-CNN. In ICCV.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NIPS.
He, K., Girshick, R., & Dollár, P. (2019). Rethinking imagenet pre-training. In: Proceedings of the IEEE international conference on computer vision (pp. 4918–4927).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
He, Z., & Zhang, L. (2019). Multi-adversarial faster-rcnn for unrestricted object detection. In ICCV.
Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A., & Darrell, T. (2018). Cycada: Cycle-consistent adversarial domain adaptation. In ICML.
Huang, X., & Belongie, S.J. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV.
Huang, X., Liu, M.Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV.
Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2017). Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4), 1–14.
Article Google Scholar
Inoue, N., Furuta, R., Yamasaki, T., & Aizawa, K. (2018). Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR.
Isola, P., Zhu, J.Y., Zhou, T., & Efros, A.A. (2017). Image-to-image translation with conditional adversarial networks. In IEEE conference on computer vision and pattern recognition.
Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., & Vasudevan, R. (2016). Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv preprint arXiv:1610.01983.
Karacan, L., Akata, Z., Erdem, A., & Erdem, E. (2016). Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215.
Kim, T., Jeong, M., Kim, S., Choi, S., & Kim, C. (2019). Diversify and match: A domain adaptive representation learning paradigm for object detection. In CVPR.
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Laffont, P. Y., Ren, Z., Tao, X., Qian, C., & Hays, J. (2014). Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on Graphics Proceedings of SIGGRAPH, 33(4), 1–11.
Article Google Scholar
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In ECCV.
Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., & Yang, M.H. (2018). Diverse image-to-image translation via disentangled representations. In ECCV.
Li, T., Qian, R., Dong, C., Liu, S., Yan, Q., Zhu, W., & Lin, L. (2018). Beautygan: Instance-level facial makeup transfer with deep generative adversarial network. In 2018 ACM multimedia conference on multimedia conference (pp. 645–653). ACM.
Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., & Belongie, S.J. (2017). Feature pyramid networks for object detection. In CVPR.
Lin, T.Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In ICCV.
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.
Liu, M.Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In NIPS.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A.C. (2016). SSD: Single shot multibox detector. In ECCV.
Long, M., Zhu, H., Wang, J., & Jordan, M.I. (2016). Unsupervised domain adaptation with residual transfer networks. In Advances in neural information processing systems.
Ma, S., Fu, J., Chen, C.W., & Mei, T. (2018). Da-gan: Instance-level image translation by deep attention generative adversarial networks. In CVPR.
Maaten, L.v.d., & Hinton, G., (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9, 2579–2605.
Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., & LeCun, Y. (2016). Disentangling factors of variation in deep representation using adversarial training. In NIPS
Mechrez, R., Talmi, I., & Zelnik-Manor, L. (2018). The contextual loss for image transformation with non-aligned data. In Proceedings of the European conference on computer vision (ECCV), 768–783.
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
Mo, S., Cho, M., & Shin, J. (2019). Instance-aware image-to-image translation. In International conference on learning representations. https://openreview.net/forum?id=ryxwJhC9YX.
Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., & Kim, K. (2018) Image to image translation for domain adaptation. In CVPR.
Nguyen, V., Vicente, Y., Tomas, F., Zhao, M., Hoai, M., & Samaras, D. (2017). Shadow detection with conditional generative adversarial networks. In ICCV.
Panareda Busto, P., & Gall, J. (2017). Open set domain adaptation. In ICCV.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS workshop.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2536–2544).
Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., & Wang, B. (2019). Moment matching for multi-source domain adaptation. In Proceedings of the IEEE international conference on computer vision (pp. 1406–1415).
Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., & Saenko, K. (2017). Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924.
Peng, X., Usman, B., Saito, K., Kaushik, N., Hoffman, J., & Saenko, K. (2018). Syn2real: A new benchmark forsynthetic-to-real visual domain adaptation. arXiv preprint arXiv:1806.09755.
Radim Tyleček, R. Š. (2013). Spatial pattern templates for recognition of objects with regular structure. Saarbrucken, Germany: In Proceeding GCPR.
Raj, A., Namboodiri, V. P., & Tuytelaars, T. (2015). Subspace alignment based domain adaptation for rcnn detector. arXiv preprint arXiv:1507.05578.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.
Saito, K., Ushiku, Y., Harada, T., & Saenko, K. (2019). Strong-weak distribution alignment for adaptive object detection. In CVPR.
Sakaridis, C., Dai, D., & Van Gool, L. (2018). Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126(9), 973–992.
Article Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In NIPS.
Sangkloy, P., Lu, J., Fang, C., Yu, F., & Hays, J. (2017). Scribbler: Controlling deep image synthesis with sketch and color. In CVPR.
Shen, Z., He, Z., & Xue, X. (2019). Meal: Multi-model ensemble via adversarial learning. In AAAI.
Shen, Z., Huang, M., Shi, J., Xue, X., & Huang, T. (2019). Towards instance-level image-to-image translation. In CVPR.
Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., & Xue, X. (2017). Dsod: Learning deeply supervised object detectors from scratch. In ICCV.
Shen, Z., Liu, Z., Li, J., Jiang, Y. G., Chen, Y., & Xue, X. (2019). Object detection from scratch with deep supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 398–412.
Article Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In CVPR.
Tzeng, E., Burns, K., Saenko, K., & Darrell, T. (2018). Splat: Semantic pixel-level adaptation transforms for detection. arXiv preprint arXiv:1812.00929.
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In CVPR.
Venkateswara, H., Eusebio, J., Chakraborty, S., & Panchanathan, S. (2017). Deep hashing network for unsupervised domain adaptation. In (IEEE) conference on computer vision and pattern recognition (CVPR).
Wang, X., Cai, Z., Gao, D., & Vasconcelos, N. (2019). Towards universal object detection by domain attention. In CVPR.
Wu, Y., Winston, E., Kaushik, D., & Lipton, Z. (2019). Domain adaptation with asymmetrically-relaxed distribution alignment. In ICML.
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. S. (2018). Generative image inpainting with contextual attention. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, R., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.
Zhang, Z., Yang, L., & Zheng, Y. (2018). Translating and segmenting multimodal medical volumes with cycle-and shapeconsistency generative adversarial network. In CVPR.
Zhao, H., Des Combes, R. T., Zhang, K., & Gordon, G. (2019). On learning invariant representations for domain adaptation. In ICML.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In CVPR.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE international conference on computer vision (ICCV).
Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017). Toward multimodal image-to-image translation. In Advances in neural information Processing Systems.
Zhu, R., Zhang, S., Wang, X., Wen, L., Shi, H., Bo, L., & Mei, T. (2019). Scratchdet: Training single-shot object detectors from scratch. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2268–2277).

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, USA
Zhiqiang Shen, Zechun Liu, Harsh Maheshwari, Yutong Zheng & Marios Savvides
SenseTime Research, Beijing, China
Mingyang Huang & Jianping Shi
Fudan University, Shanghai, China
Xiangyang Xue
University of Illinois at Urbana-Champaign, Champaign, USA
Thomas S. Huang

Authors

Zhiqiang Shen
View author publications
You can also search for this author in PubMed Google Scholar
Mingyang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Shi
View author publications
You can also search for this author in PubMed Google Scholar
Zechun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Harsh Maheshwari
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyang Xue
View author publications
You can also search for this author in PubMed Google Scholar
Marios Savvides
View author publications
You can also search for this author in PubMed Google Scholar
Thomas S. Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiqiang Shen.

Additional information

Communicated by Dengxin Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, Z., Huang, M., Shi, J. et al. CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection. Int J Comput Vis 129, 761–780 (2021). https://doi.org/10.1007/s11263-020-01394-z

Download citation

Received: 15 March 2020
Accepted: 14 October 2020
Published: 24 November 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11263-020-01394-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation