Skip to main content

An R-CNN Based Method to Localize Speech Balloons in Comics

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9516))

Included in the following conference series:


Comic books enjoy great popularity around the world. More and more people choose to read comic books on digital devices, especially on mobile ones. However, the screen size of most mobile devices is not big enough to display an entire comic page directly. As a consequence, without any reflow or adaption to the original books, users often find that the texts on comic pages are hard to recognize when reading comics on mobile devices. Given the positions of speech balloons, it becomes quite easy to do further processing on texts to make them easier to read on mobile devices. Because the texts on a comic page often come along with surrounding speech balloons. Therefore, it is important to devise an effective method to localize speech balloons in comics. However, only a few studies have been done in this direction. In this paper, we propose a Regions with Convolutional Neural Network (R-CNN) based method to localize speech balloons in comics. Experimental results have demonstrated that the proposed method can localize the speech balloons in comics effectively and accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. Rigaud, C., Burie, J., Ogier, J., Karatzas, D., Weijer, J.: An active contour model for speech balloon detection in comics. In: International Conference on Document Analysis and Recognition, Washington, DC, pp. 1240–1244 (2013)

    Google Scholar 

  2. Arai, K., Tolle, H.: Method for real time text extraction of digital manga comic. Int. J. Image Process. 4(6), 669676 (2011)

    Google Scholar 

  3. Ho, A.N., Burie, J., Ogier, J.: Panel and speech balloon extraction from comic books. In: International Workshop on Document Analysis Systems, Gold Cost, QLD, pp. 424–428 (2012)

    Google Scholar 

  4. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Computer Vision and Pattern Recognition, Columbus, OH, pp. 580–587 (2014)

    Google Scholar 

  5. Gu, C., Lim, J.J., Arbelaez, P., Malik, J.: Recognition using regions. In: Computer Vision and Pattern Recognition, Miami, FL, pp. 1030–1037 (2009)

    Google Scholar 

  6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, South Lake Tahoe, Nevada, pp. 1097–1105 (2012)

    Google Scholar 

  7. Girshick, R.: GitHub, May 2014.

  8. Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vision 104(2), 154–171 (2013)

    Article  Google Scholar 

  9. Alexe, B., Deselaers, T., Ferrari, V.: Measuring the objectness of image windows. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2189–2202 (2012)

    Article  Google Scholar 

  10. Jia, Y.: Caffe: an open source convolutional architecture for fast feature embedding, May 2013.

  11. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2009)

    Article  Google Scholar 

  12. Sung, K.-K., Poggio, T.: Example-based learning for view-based human face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 39–51 (1998)

    Article  Google Scholar 

Download references


This work is supported by National Natural Science Foundation of China (Grant 61300061) and Beijing Natural Science Foundation (4132033).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Xicheng Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, Y., Liu, X., Tang, Z. (2016). An R-CNN Based Method to Localize Speech Balloons in Comics. In: Tian, Q., Sebe, N., Qi, GJ., Huet, B., Hong, R., Liu, X. (eds) MultiMedia Modeling. MMM 2016. Lecture Notes in Computer Science(), vol 9516. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27670-0

  • Online ISBN: 978-3-319-27671-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics