ICDAR 2023 Competition on Born Digital Video Text Question Answering

Yang, Zhibo; Song, Xiaoge; Song, Sibo; Lu, Tong; Bai, Xiang; Liu, Cheng-Lin; Huang, Fei; Yao, Cong

doi:10.1007/978-3-031-41679-8_30

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14188))

Included in the following conference series:

International Conference on Document Analysis and Recognition

960 Accesses

Abstract

This paper presents the final results of the ICDAR 2023 Competition on Born Digital Video Text Question Answering (i.e., BDVT-QA) which contains two major task tracks: 1) End-to-End Video Text Spotting, and 2) Video Text Question Answering. BDVT-QA aims to spot texts and answer questions from born-digital videos. The proposed competition introduces a brand new dataset consisting of 1,000 video clips fully annotated with manually-designed question/answer pairs, where the answers are based on the text captions presented in the video clips. A total of 23 final submissions were received for this competition. The top-3 performances of each track are as follows: 1)T1.1 - 57.53%, T1.2 - 53.3%, T1.3 - 52.35%, and 2) T2.1 - 31.2%, T2.2 - 28.84%, T2.3 - 21.19%. We summarize the submitted methods and give a deep analysis. Besides, this paper also includes dataset descriptions, task definitions and evaluation protocols. The dataset and the final ranking of submissions are publicly available on the challenge’s official website: https://tianchi.aliyun.com/specials/promotion/ICDAR_2023_Competition_on_Born_Digital_Video_Text_QA.

Z. Yang, X. Song and S. Song—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive sequence models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 178–196. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_11
Chapter Google Scholar
Biten, A.F., et al.: ICDAR 2019 competition on scene text visual question answering. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1563–1570. IEEE (2019)
Google Scholar
Biten, A.F., et al.: Scene text visual question answering. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4290–4300 (2019)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 6154–6162. Computer Vision Foundation/IEEE Computer Society (2018)
Google Scholar
Cheng, Z., Lu, J., Niu, Y., Pu, S., Wu, F., Zhou, S.: You only recognize once: towards fast video text spotting. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 855–863 (2019)
Google Scholar
Chng, C.K., et al.: ICDAR2019 robust reading challenge on arbitrary-shaped text - RRC-art. In: 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, 20–25 September 2019, pp. 1571–1576. IEEE (2019)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 2315–2324. IEEE Computer Society (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. IEEE Computer Society (2016)
Google Scholar
Kuang, Z., et al.: MMOCR: a comprehensive toolbox for text detection, recognition and understanding. In: MM 2021: ACM Multimedia Conference, Virtual Event, China, 20–24 October 2021, pp. 3791–3794. ACM (2021)
Google Scholar
Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11474–11481 (2020)
Google Scholar
Nayef, N., et al.: ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition - RRC-MLT-2019. In: 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, 20–25 September 2019, pp. 1582–1587. IEEE (2019)
Google Scholar
Reddy, S., Mathew, M., Gómez, L., Rusiñol, M., Karatzas, D., Jawahar, C.V.: Roadtext-1k: text detection & recognition dataset for driving videos. In: 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, 31 May–31 August 2020, pp. 11074–11080. IEEE (2020)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8309–8318 (2019)
Google Scholar
Tian, S., Pei, W.Y., Zuo, Z.Y., Yin, X.C.: Scene text detection in video by learning locally and globally. In: International Joint Conference on Artificial Intelligence, IJCAI, pp. 2647–2653 (2016)
Google Scholar
Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 5987–5995. IEEE Computer Society (2017)
Google Scholar
Xue, C., Zhang, W., Hao, Y., Lu, S., Torr, P.H.S., Bai, S.: Language matters: a weakly supervised vision-language pre-training approach for scene text detection and spotting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 284–302. Springer, Cham (2022)
Google Scholar
Yang, X.H., He, W., Yin, F., Liu, C.L.: A unified video text detection method with network flow. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 331–336 (2017)
Google Scholar

Download references

Acknowledgments

The authors express their gratitude to the Competition Chairs for their valuable input in organizing the competition and for their critical review of the competition report. This challenge is sponsored by Alibaba Group. This work is also supported by NSFC (62225603), NSFC (61672273) and NSFC (61832008).

Author information

Authors and Affiliations

Alibaba Group, Hangzhou, China
Zhibo Yang, Sibo Song, Fei Huang & Cong Yao
Nanjing University, Nanjing, China
Xiaoge Song & Tong Lu
Huazhong University of Science and Technology, Wuhan, China
Zhibo Yang & Xiang Bai
Institute of Automation of Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Authors

Zhibo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoge Song
View author publications
You can also search for this author in PubMed Google Scholar
Sibo Song
View author publications
You can also search for this author in PubMed Google Scholar
Tong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Bai
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-Lin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Fei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Cong Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhibo Yang .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Z. et al. (2023). ICDAR 2023 Competition on Born Digital Video Text Question Answering. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-41679-8_30
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

ICDAR 2023 Competition on Born Digital Video Text Question Answering