Skip to main content
Log in

Cover-based multiple book genre recognition using an improved multimodal network

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript


Despite the idiom not to prejudge something by its outward appearance, we consider deep learning to learn whether we can judge a book by its cover or, more precisely, by its text and design. The classification was accomplished using three strategies, i.e., text only, image only, and both text and image. State-of-the-art CNNs (convolutional neural networks) models were used to classify books through cover images. The Gram and SE layers (squeeze and excitation) were used as an attention unit in them to learn the optimal features and identify characteristics from the cover image. The Gram layer enabled more accurate multi-genre classification than the SE layer. The text-based classification was done using word-based, character-based, and feature engineering-based models. We designed EXplicit interActive Network (EXAN) composed of context-relevant layers and multi-level attention layers to learn features from books title. We designed an improved multimodal fusion architecture for multimodal classification that uses an attention mechanism between modalities. The disparity in modalities convergence speed is addressed by pre-training each sub-network independently prior to end-to-end training of the model. Two book cover datasets were used in this study. Results demonstrated that text-based classifiers are superior to image-based classifiers. The proposed multimodal network outperformed all models for this task with the highest accuracy of 69.09% and 38.12% for Latin and Arabic book cover datasets. Similarly, the proposed EXAN surpassed the extant text classification models by scoring the highest prediction rates of 65.20% and 33.8% for Latin and Arabic book cover datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

The data and code will be available publically.


  1. Lucieri, A., et al.: Benchmarking deep learning models for classification of book covers. SN Comput. Sci. 1(3), 1–16 (2020)

    Article  Google Scholar 

  2. Iwana, B.K. et al.: Judging a book by its cover. arXiv preprint arXiv:1610.09204 (2016)

  3. Chiang, H., Ge, Y., Wu, C.: Classification of book genres by cover and title. (2015)

  4. Buczkowski, P., Sobkowicz, A., Kozlowski, M.: Deep learning approaches towards book covers classification. In: ICPRAM (2018)

  5. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  6. Kundu, C., Lukun, Z.: Deep multimodal networks for book genre classification based on its cover. arXiv preprint arXiv:2011.07658 (2020)

  7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  8. Cer, D. et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)

  9. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)

    Article  Google Scholar 

  10. McKay, C., Fujinaga, I.: Automatic genre classification using large high-level musical feature sets. ISMIR 2004, 2004 (2004)

    Google Scholar 

  11. Pye, D.: Content-based methods for the management of digital music. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100). Vol. 4. IEEE (2000)

  12. Karayev, S. et al.: Recognizing image style. arXiv preprint arXiv:1311.3715 (2013)

  13. Kong, J., Zhang, L., Jiang, M., Liu, T.: Incorporating multi-level CNN and attention mechanism for Chinese clinical named entity recognition. J. Biomed. Inform. 116, 103737 (2021)

    Article  Google Scholar 

  14. Liu, T., Zhao, R., Lam, K.M., Kong, J.: Visual-semantic graph neural network with pose-position attentive learning for group activity recognition. Neurocomputing 491, 217–231 (2022)

    Article  Google Scholar 

  15. Zujovic, J. et al.: Classifying paintings by artistic genre: an analysis of features & classifiers. In: 2009 IEEE International Workshop on Multimedia Signal Processing. IEEE (2009)

  16. Finn, A., Kushmerick, N.: Learning to classify documents according to genre. J. Am. Soc. Inform. Sci. Technol. 57(11), 1506–1518 (2006)

    Article  Google Scholar 

  17. Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)

    Article  Google Scholar 

  18. Brown, P.F., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–480 (1991)

    Google Scholar 

  19. Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning (2006)

  20. Du, C. et al.: Explicit interaction model towards text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. (2019)

  21. Joulin, A. et al.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

  22. Zhang, X., Junbo Z., Yann L.: Character-level convolutional networks for text classification. arXiv preprint arXiv:1509.01626 (2015)

  23. Conneau, A. et al.: Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781 (2016)

  24. Szegedy, C. et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

  25. He, K. et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016)

  26. Xie, S. et al.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  27. Huang, G. et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  28. Szegedy, C. et al.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. (2017)

  29. Sandler, M. et al.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

  30. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 15(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  31. Gatys, L.A., Alexander S.E., Matthias B.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

  32. Luan, F. et al.: Deep photo style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  33. Cao, C., Liu, X., Yang, Y., Yu, Y., Wang, J., Wang, Z., Huang, Y., Wang, L., Huang, C., Xu, W., Ramanan, D.: Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 2956-2964 (2015)

  34. Hu, J., Li, S., Gang, S.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

  35. Han, W., Chen, H. and Poria, S.: Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arXiv preprint arXiv:2109.00412. (2021)

  36. Li, Z., Xu, B., Zhu, C., Zhao, T.: CLMLF: a contrastive learning and multi-layer fusion method for multimodal sentiment detection. arXiv preprint arXiv:2204.05515 (2022)

  37. Truong, Q.T., Lauw, H.W.: Vistanet: visual aspect attention network for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, No. 01, pp. 305–312 (2019)

  38. You, Q., Cao, L., Jin, H., Luo, J.: Robust visual-textual sentiment analysis: when attention meets tree-structured recursive neural networks. In: Proceedings of the 24th ACM International Conference on Multimedia pp 1008–1017 (2016)

  39. Heaton, J.: Ian goodfellow, yoshua bengio, and aaron courville: deep learning. 305–307 (2018)

  40. Koontz, C., Barbara, G. (eds.): IFLA Public Library Service Guidelines. De Gruyter, Berlin (2020)

    Google Scholar 

Download references


The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Author information

Authors and Affiliations



Conceptualization, Assad Rasheed and Arif Iqbal Umar; methodology, Assad Rasheed and Syed Hamad Shirazi; software, Assad Rasheed and Zakir Khan.; validation, Zakir Khan, and Shahzad Ahmad; formal analysis, Assad Rasheed; investigation, Shahzad Ahmad; data curation, Assad Rasheed; draft preparation, Arif Iqbal Umar; review and editing, Syed Hamad Shirazi and Zakir Khan.; supervision, Arif Iqbal Umar and Syed Hamad Shirazi.

Corresponding author

Correspondence to Assad Rasheed.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rasheed, A., Umar, A.I., Shirazi, S.H. et al. Cover-based multiple book genre recognition using an improved multimodal network. IJDAR 26, 65–88 (2023).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: