Skip to main content
Log in

Benchmarking Deep Learning Models for Classification of Book Covers

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Book covers usually provide a good depiction of a book’s content and its central idea. The classification of books in their respective genre usually involves subjectivity and contextuality. Book retrieval systems would utterly benefit from an automated framework that is able to classify a book’s genre based on an image, specifically for archival documents where digitization of the complete book for the purpose of indexing is an expensive task. While various modalities are available (e.g., cover, title, author, abstract), benchmarking the image-based classification systems based on minimal information is a particularly exciting field due to the recent advancements in the domain of image-based deep learning and its applicability. For that purpose, a natural question arises regarding the plausibility of solving the problem of book classification by only utilizing an image of its cover along with the current state-of-the-art deep learning models. To answer this question, this paper makes a three-fold contribution. First, the publicly available book cover dataset comprising of 57k book covers belonging to 30 different categories is thoroughly analyzed and corrected. Second, it benchmarks the performance on a battery of state-of-the-art image classification models for the task of book cover classification. Third, it uses explicit attention mechanisms to identify the regions that the network focused on in order to make the prediction. All of our evaluations were performed on a subset of the mentioned public book cover dataset. Analysis of the results revealed the inefficacy of the most powerful models for solving the classification task. With the obtained results, it is evident that significant efforts need to be devoted in order to solve this image-based classification task to a satisfactory level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://github.com/adriano-lucieri/book-dataset

References

  1. Afzal MZ, Capobianco S, Malik MI, Marinai S, Breuel TM, Dengel A, Liwicki M. Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th international conference on document analysis and recognition (ICDAR); 2015. p. 1111–5. https://doi.org/10.1109/ICDAR.2015.7333933.

  2. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2018. vol. 3, p. 6.

  3. Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE. 2015;10(7):2015.

    Google Scholar 

  4. Buczkowski P, Sobkowicz A, Kozlowski M. Deep learning approaches towards book covers classification. In: International conference on pattern recognition applications and methods (ICPRAM); 2018. p. 309–16. https://doi.org/10.5220/0006556103090316.

  5. Chen LC, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation; 2017. arXiv:1706.05587.

  6. Chiu CC, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2018. p. 4774–8.

  7. Cubuk ED, Zoph B, Mané D, Vasudevan V, Le QV. Autoaugment: learning augmentation policies from data. CoRR; 2018. arXiv:abs/1805.09501.

  8. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems (NIPS); 2014. p. 2672–80.

  9. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2016. p. 770–8.

  10. Hou X, Zhang L. Saliency detection: a spectral residual approach. In: The IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2007. p. 1–8.

  11. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–41.

  12. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–8.

  13. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. Squeezenet: alexnet-level accuracy with 50\(\times\) fewer parameters and< 0.5 mb model size; 2016. arXiv:1602.07360.

  14. Iwana BK, Rizvi STR, Ahmed S, Dengel A, Uchida S. Judging a book by its cover; 2016. arXiv:1610.09204.

  15. Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems (NIPS); 2015. p. 2017–25.

  16. Jolly S, Iwana BK, Kuroki R, Uchida S. How do convolutional neural networks learn design? In: 2018 24th international conference on pattern recognition (ICPR). IEEE; 2018. p. 1085–90.

  17. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers. Association for Computational Linguistics; 2017. p. 427–31.

  18. Karayev S, Trentacoste M, Han H, Agarwala A, Darrell T, Hertzmann A, Winnemoeller H. Recognizing image style; 2013. arXiv:1311.3715.

  19. Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of gans for improved quality, stability, and variation; 2017. arXiv:1710.10196.

  20. Kjartansson S, Ashavsky A. Can you judge a book by its cover? Stanford CS231N; 2017. http://cs231n.stanford.edu/reports/2017/pdfs/814.pdf.

  21. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS); 2012. p. 1097–105.

  22. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

    Article  Google Scholar 

  23. LeCun Y, Cortes C. MNIST handwritten digit database; 2010. http://yann.lecun.com/exdb/mnist/.

  24. Libeks J, Turnbull D. You can judge an artist by an album cover: using images for music annotation. IEEE Multi Med. 2011;18(4):30–7.

    Article  Google Scholar 

  25. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in context. In: European conference on computer vision. Springer; 2014. p. 740–55.

  26. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. Playing atari with deep reinforcement learning; 2013. arXiv:1312.5602.

  27. Nar K, Ocal O, Sastry SS, Ramchandran K. Cross-entropy loss leads to poor margins. OpenReview; 2019. https://openreview.net/forum?id=ByfbnsA9Km.

  28. Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY. Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning; 2011. vol. 2011, p. 5.

  29. Nilsback ME, Zisserman A. Automated flower classification over a large number of classes. In: Sixth Indian conference on computer vision, graphics & image processing, 2008. ICVGIP’08. IEEE; 2008. p. 722–9.

  30. Oramas S, Barbieri F, Nieto O, Serra X. Multimodal deep learning for music genre classification. Trans Int Soc Music Inf Retr. 2018;1(1):4–21.

    Google Scholar 

  31. Oramas S, Nieto O, Barbieri F, Serra X. Multi-label music genre classification from audio, text, and images using deep features; 2017. arXiv:1707.04916.

  32. Rodríguez P, Cucurull G, Gonzàlez J, Gonfaus JM, Roca X. A painless attention mechanism for convolutional neural networks; 2018. https://openreview.net/forum?id=rJe7FW-Cb.

  33. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L. ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV). 2015;115(3):211–52. https://doi.org/10.1007/s11263-015-0816-y.

    Article  MathSciNet  Google Scholar 

  34. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv:1409.1556.

  35. Sobkowicz A, Kozłowski M, Buczkowski P. Reading book by the cover-book genre detection using short descriptions. In: International conference on man–machine interactions. Springer; 2017. p. 439–48.

  36. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI conference on artificial intelligence; 2017. vol. 4, p. 12.

  37. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR) 2015.

  38. Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-UCSD birds-200-2011 dataset 2011.

  39. Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. Tacotron: a fully end-to-end text-to-speech synthesis model; 2017. arXiv:1703.10135.

  40. Yao JG, Wan X, Xiao J. Recent advances in document summarization. Knowl Inf Syst. 2017;53(2):297–336. https://doi.org/10.1007/s10115-017-1042-4.

    Article  Google Scholar 

  41. Yu F, Seff A, Zhang Y, Song S, Funkhouser T, Xiao J. Lsun: construction of a large-scale image dataset using deep learning with humans in the loop; 2015. arXiv:1506.03365.

  42. Zoph B, Vasudevan V, Shlens J, Le QV. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8697–710.

  43. Zujovic J, Gandy L, Friedman S, Pardo B, Pappas TN. Classifying paintings by artistic genre: an analysis of features & classifiers. In: IEEE international workshop on multimedia signal processing, 2009. MMSP’09. IEEE; 2009. p. 1–5.

Download references

Acknowledgements

This work was supported by the BMBF project DeFuseNN (Grant 01IW17002) and partially supported by JSPS KAKENHI (Grant JP17H06100). We thank all members of the Deep Learning Competence Center at the DFKI for their comments and support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adriano Lucieri.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

“This article is part of the topical collection “Document Analysis and Recognition” guest edited by Michael Blumenstein, Seiichi Uchida and Cheng-Lin Liu”.

The source code and the models are available at https://github.com/adriano-lucieri/BookCoverClassification .

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lucieri, A., Sabir, H., Siddiqui, S.A. et al. Benchmarking Deep Learning Models for Classification of Book Covers. SN COMPUT. SCI. 1, 139 (2020). https://doi.org/10.1007/s42979-020-00132-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-020-00132-z

Keywords

Navigation