Skip to main content
Log in

CNN-based segmentation of speech balloons and narrative text boxes from comic book page images

  • Special Issue Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Most of the recent research works on comic document images have focused on the reading and distribution of comics digitally due to the evolution of technologies. In this work, the extraction of narrative text boxes and speech balloons, which contain the conversations among comic characters along with their feelings, is presented. Due to the huge variety of drawing styles, the shape of these speech balloons is complex, and extraction is difficult. We present a shape-aware dual-stream convolutional neural network for the segmentation of narrative text boxes and speech balloons of various shapes. In our dual-stream architecture, an added shape module processes edge information of the speech balloons and narrative texts with the main module. Later, the concatenation of these two modules produces more accurate segmentation of speech balloons and narrative text boxes. The proposed method achieves significant performance improvements in terms of both region accuracy (mIOU) and boundary accuracy (F-measure and Hausdorff distance) compared to other state-of-the-art methods on various publicly available comic datasets (namely eBDtheque, DCM and Manga 109 dataset subset) in different languages. In addition, we have developed a new dataset (BCBId) for comics in Bangla, the eighth most spoken language in the world, and propose a method for the development of ground-truth images in a semiautomatic way.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Codes and data are available at https://github.com/Arpi07/Arpi07-2/tree/Speech_balloon_segmentation.

References

  1. BCBID: sites.google.com/view/banglacomicbookdataset. Accessed 8 Sept 2020

  2. Christophe Rigaud|Gitlab. https://git.univ-lr.fr/u/crigau02. Accessed 7 Jan 2020

  3. Digital Comic Museum. https://digitalcomicmuseum.com/. Accessed 29 May 2019

  4. Arai, K., Tolle, H.: Method for real time text extraction of digital manga comic. Int. J. Image Process. IJIP 4(6), 669–676 (2011)

    Google Scholar 

  5. Augereau, O., Iwata, M., Kise, K.: An overview of comics research in computer science. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 3, pp. 54–59. IEEE (2017)

  6. Augereau, O., Iwata, M., Kise, K.: A survey of comics research in computer science. J. Imaging 4(7), 87 (2018)

    Article  Google Scholar 

  7. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)

    MATH  Google Scholar 

  8. Cao, Y., Pang, X., Chan, A.B., Lau, R.W.: Dynamic manga: animating still manga via camera movement. IEEE Trans. Multimedia 19(1), 160–172 (2016)

    Article  Google Scholar 

  9. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV, pp. 801–818 (2018)

  10. Dubray, D., Laubrock, J.: Deep CNN-based speech balloon detection and segmentation for comic books. arXiv preprint arXiv:1902.08137 (2019)

  11. Dubuisson, M.P., Jain, A.K.: A modified Hausdorff distance for object matching. In: Proceedings of 12th International Conference on Pattern Recognition, vol. 1, pp. 566–568. IEEE (1994)

  12. Dunst, A., Laubrock, J., Wildfeuer, J.: Empirical Comics Research: Digital, Multimodal, and Cognitive Methods. Routledge, Milton Park (2018)

    Book  Google Scholar 

  13. Dutta, A., Biswas, S.: CNN based extraction of panels/characters from bengali comic book page images. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 1, pp. 38–43. IEEE (2019)

  14. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: International Conference on Learning Representations (2019)

  15. Guérin, C., Rigaud, C., Mercier, A., Ammar-Boudjelal, F., Bertet, K., Bouju, A., Burie, J.C., Louis, G., Ogier, J.M., Revel, A.: eBDtheque: a representative database of comics. In: ICDAR, pp. 1145–1149. IEEE (2013)

  16. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  18. Ho, A.K.N., Burie, J.C., Ogier, J.M.: Panel and speech balloon extraction from comic books. In: DAS, 2012, pp. 424–428. IEEE (2012)

  19. Huttenlocher, D.P., Klanderman, G.A., Rucklidge, W.J.: Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 850–863 (1993)

    Article  Google Scholar 

  20. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1(4), 321–331 (1988)

    Article  Google Scholar 

  21. Li, L., Wang, Y., Gao, L., Tang, Z., Suen, C.Y.: Comic2cebx: a system for automatic comic content adaptation. In: IEEE/ACM Joint Conference on Digital Libraries, pp. 299–308. IEEE (2014)

  22. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  23. Matsui, Y., Ito, K., Aramaki, Y., Fujimoto, A., Ogawa, T., Yamasaki, T., Aizawa, K.: Sketch-based manga retrieval using manga109 dataset. Multimedia Tools Appl. 76(20), 21811–21838 (2017)

    Article  Google Scholar 

  24. Matsui, Y., Yamasaki, T., Aizawa, K.: Interactive manga retargeting. In: SIGGRAPH Posters, p. 35 (2011)

  25. Nguyen, N.V., Rigaud, C., Burie, J.C.: Digital comics image indexing based on deep learning. J. Imaging 4(7), 89 (2018)

    Article  Google Scholar 

  26. Nguyen, N.V., Rigaud, C., Burie, J.C.: Comic MTL: optimized multi-task learning for comic book image analysis. Int. J. Doc. Anal. Recognit. IJDAR 22(3), 265–284 (2019)

    Article  Google Scholar 

  27. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528 (2015)

  28. Ogawa, T., Otsubo, A., Narita, R., Matsui, Y., Yamasaki, T., Aizawa, K.: Object detection for comics using manga109 annotations. arXiv:1803.08670 (2018)

  29. Osserman, R., et al.: The isoperimetric inequality. Bull. Am. Math. Soc. 84(6), 1182–1238 (1978)

    Article  MathSciNet  Google Scholar 

  30. Prewitt, J.M.: Object enhancement and extraction. Picture Process. Psychopictorics 10(1), 15–19 (1970)

    Google Scholar 

  31. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)

  32. Ribera, J., Guera, D., Chen, Y., Delp, E.J.: Locating objects without bounding boxes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6479–6489 (2019)

  33. Rigaud, C., Burie, J.C., Ogier, J.M.: Text-independent speech balloon segmentation for comics and manga. In: International Workshop on Graphics Recognition, pp. 133–147. Springer (2015)

  34. Rigaud, C., Burie, J.C., Ogier, J.M., Karatzas, D., Van de Weijer, J.: An active contour model for speech balloon detection in comics. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1240–1244. IEEE (2013)

  35. Rigaud, C., Guérin, C., Karatzas, D., Burie, J.C., Ogier, J.M.: Knowledge-driven understanding of images in comic books. IJDAR 18(3), 199–221 (2015)

    Article  Google Scholar 

  36. Rigaud, C., Le Thanh, N., Burie, J.C., Ogier, J.M., Iwata, M., Imazu, E., Kise, K.: Speech balloon and speaker association for comics and manga understanding. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 351–355. IEEE (2015)

  37. Rigaud, C., Nguyen, V., Burie, J.C.: Confidence criterion for speech balloon segmentation. In: 13th IAPR International Workshop on Graphics Recognition (2019)

  38. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer (2015)

  39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  40. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  41. Sun, W., Kise, K.: Similar manga retrieval using visual vocabulary based on regions of interest. In: 2011 International Conference on Document Analysis and Recognition, pp. 1075–1079. IEEE (2011)

  42. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, Boca Raton (2008)

    MATH  Google Scholar 

  43. Woo, S., Park, J., Lee, J.Y., So Kweon, I.: Cbam: Convolutional block attention module. In: ECCV, pp. 3–19 (2018)

  44. Yamada, M., Budiarto, R., Endo, M., Miyazaki, S.: Comic image decomposition for reading comics on cellular phones. IEICE Trans. Inf. Syst. 87(6), 1370–1376 (2004)

    Google Scholar 

  45. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS. Curran Associates (2014)

  46. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: 4th International Conference on Learning Representations, ICLR 2016

  47. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arpita Dutta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dutta, A., Biswas, S. & Das, A.K. CNN-based segmentation of speech balloons and narrative text boxes from comic book page images. IJDAR 24, 49–62 (2021). https://doi.org/10.1007/s10032-021-00366-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-021-00366-4

Keywords

Navigation