Skip to main content

Benchmarking Deep Learning Models for Classification of Book Covers


Book covers usually provide a good depiction of a book’s content and its central idea. The classification of books in their respective genre usually involves subjectivity and contextuality. Book retrieval systems would utterly benefit from an automated framework that is able to classify a book’s genre based on an image, specifically for archival documents where digitization of the complete book for the purpose of indexing is an expensive task. While various modalities are available (e.g., cover, title, author, abstract), benchmarking the image-based classification systems based on minimal information is a particularly exciting field due to the recent advancements in the domain of image-based deep learning and its applicability. For that purpose, a natural question arises regarding the plausibility of solving the problem of book classification by only utilizing an image of its cover along with the current state-of-the-art deep learning models. To answer this question, this paper makes a three-fold contribution. First, the publicly available book cover dataset comprising of 57k book covers belonging to 30 different categories is thoroughly analyzed and corrected. Second, it benchmarks the performance on a battery of state-of-the-art image classification models for the task of book cover classification. Third, it uses explicit attention mechanisms to identify the regions that the network focused on in order to make the prediction. All of our evaluations were performed on a subset of the mentioned public book cover dataset. Analysis of the results revealed the inefficacy of the most powerful models for solving the classification task. With the obtained results, it is evident that significant efforts need to be devoted in order to solve this image-based classification task to a satisfactory level.


Books have been the most prevalent medium for imparting knowledge for the past few centuries. Book covers provide the first impression of a book’s content, subject and its central idea. This information is depicted by a combination of visual and textual information [14]. However, this visual interpretation is subjective and varies from person to person depending upon his/her background and perspective. Therefore, this makes the interpretation of book covers based on just visual content extremely challenging, even for humans. Figure 1 shows some randomly selected book covers after blurring out the text. The lack of textual information makes it hard to guess the correct category for these covers. It is worth mentioning that these books are visually descriptive. Having a very abstract or plain background is also very common in book covers making the task almost impossible to solve without textual aid [16]. Therefore, it is particularly interesting to analyze the efficacy of state-of-the-art image classification models for the identification of book cover genres based on just an image of the book cover. Having the ability to automatically categorize and classify book covers without explicit human intervention could significantly improve the performance of current generation book retrieval systems. Relying on only the book cover image is a significantly harder problem as compared to explicitly taking the whole textual content into account, along with being better suited for end-users.

Fig. 1

Book covers containing no textual cues highlighting the difficulty of the task solely relying on the visual content

Deep learning has been applied to a wide variety of problems since its resurgence in 2012 when Krizhevsky et al.  [21] were directly able to reduce the error rate to half on a standard image classification benchmark challenge, comprising of millions of images [33], by just employing a deep model. These applications include image classification [9, 21, 37], image synthesis [8], image captioning [2], semantic segmentation [5], voice recognition [6], audio synthesis [39], document classification and understanding [1, 40], as well as playing Atari games [26].

With these advances in the domain of computer vision, an implicit assumption is made regarding the ability to directly solve the problem of book cover classification by employing the state-of-the-art image classification models. Therefore, we try to answer this question, by employing the most powerful image recognition models (NASNet, SE-ResNeXt-50, SE-ResNet-50, Inception ResNet v2, DenseNet-161, ResNet-152, ResNet-50 and VGG-16) to date in order to automatically classify these book covers. Finally, we also consider the impact of employing textual information along with the visual modality in order to quantify the gains by using this visual representation.

The main contributions of this paper are threefold:

  1. 1.

    Detailed insights into the book cover classification dataset introduced by Iwana et al.  [14] and its complexity.

  2. 2.

    A detailed evaluation of the state-of-the-art classification models for the task of book cover classification. This helps in establishing a benchmark on this problem. Furthermore, it also outlines the challenges due to which even the state-of-the-art models fail to solve this problem.

  3. 3.

    Identification of the regions of the input that the network focused on in order to make the prediction by equipping the model with an explicit attention mechanism as a way to identify the regions of the input deemed as useful by the network. This attention mechanism also helps the network in attaining minor improvements in the computed metrics.

The rest of the paper is structured as follows. We first present a brief recapitulation of the previous work done in the direction of book cover classification in “Related Works” section. We then provide details of the dataset used in this study in “Dataset” section. State-of-the-art deep learning models for image classification are evaluated on the task of book cover classification in “State-of-the-Art Models for Book Cover Classification” section. Besides, the dataset distribution and associated challenges are analyzed in depth. Quantification of the impact of dataset cleansing on the classification accuracy is then provided in “Cleansed Dataset Evaluation” section, followed by an extensive tweaking of the model architecture to unveil the task’s difficulties in “Extensive Model Tweaking” section. Finally, we conclude the paper with the concluding remarks in “Conclusion” section.

Related Works

The classification of artistic book covers is a sophisticated task, due to its subjectiveness and the fuzzy nature of class affiliation. However, exploration and analysis of underlying patterns can reveal interesting coherencies that could be of use for both understanding arts and aiding artists as inspirational influences.

In the field of book genre classification, Iwana et al.  [14] have proposed a publicly available dataset of book covers comprising of 30 categories. They tried genre classification using LeNet [22] and AlexNet [21] architectures with the book cover image as input and achieved a baseline accuracy of 24.5%. Kjartansson and Ashavsky  [20] approached a subset of the dataset from [14]. They chose only ten of the original 30 categories, reducing the dataset to 19k samples. By applying several image-based and text-based approaches, they achieved higher accuracies on this subset compared to Iwana et al.  [14] on the original dataset. For their image-based approaches, older architectures like VGG-16 [34], SqueezeNet [13] and ResNet-50 [9] were used. Buczkowski et al.  [4, 35] similarly approached genre classification using book covers and short textual descriptions. They focused on a completely different dataset, crawled from, comprising of 14 categories. On this dataset, they evaluate a simple convolutional neural network (CNN) architecture as well as a VGG-like architecture. Recently, Jolly et al.  [16] applied layer-wise relevance propagation (LRP) [3] method to the trained CNN [22] for book cover classification to explain the cover image’s pixel-wise contributions and spot the most relevant elements of the artworks for genre classification.

Besides, other types of genre classification have also been a focus of research. Oramas et al.  [30, 31] performed genre classification of album covers using textual review information as well as using a multi-modal approach, where textual, acoustic and visual information of music albums were also leveraged. Libeks and Turnbull  [24] annotated music albums with genre tags using only the album cover artwork as well as promotional photographs. Similarly, the classification of painting styles has been studied in [18, 43].


We used the publicly available dataset of book covers proposed by Iwana et al.  [14]. The raw dataset contains information regarding the book cover title, authors, main category, multiple subcategories and link to the images of over 207k book covers from The book covers are classified into 32 main categories and over 3200 subcategories. In cases where the book cover was assigned to multiple categories, one category has been randomly selected. Iwana et al.  [14] employed a subset of around 57k books from the original dataset for the corresponding experimentation. This subset was equalized to contain 1900 samples per class. Two classes (Gay & Lesbian and Education & Teaching) were discarded as they comprised of a limited number of samples. This dataset will be referred to as 30cat dataset in this paper.

The book cover dataset is quite complex when we compare it to standard image classification datasets like Caltech [38], MS-COCO [25], Oxford-102 [29], LSUN [41] and even ImageNet [33]. It is significantly difficult even for humans to classify. Most of the categories are distinguishable based on the textual description, while some categories contain significant visual cues (recognizable objects) for its discrimination [16].

State-of-the-Art Models for Book Cover Classification

Automated classification of book covers is an interesting philosophical question along with its practical implications, thus beneficial in a variety of applications. Genre classification could be useful in reducing time and costs invested for indexing books in big libraries or even e-commerce platforms by utilizing only a single image of the book cover. Book cover image classification by means of machine learning methods has already been approached by some studies [4, 14] as well as classification leveraging textual information like book titles or descriptions [20, 35]. These studies vary significantly in terms of the employed dataset for the corresponding experimentation along with the computed metrics. Despite these efforts, none of the mentioned approaches for book cover image classification has employed state-of-the-art networks for this task. Iwana et al. [14] used AlexNet [21] and LeNet [22] architectures, whereas Buczkowski et al.  [4] employed a shallow CNN similar to the VGG architecture in their work. However, no method has achieved any significant improvements in terms of the computed metrics. Therefore, a natural question that originates is regarding the task itself. Whether the task of book cover classification itself is extremely difficult for the current generation of machine learning models or the models that were employed in those studies were themselves not adequate for the task. In order to answer this question, we employed some of the most recent state-of-the-art image classification models for the task of book cover classification.


The models evaluated in this paper include NASNet [42], SE-ResNeXt-50 [11], SE-ResNet-50 [11], Inception ResNet v2 [36], DenseNet-161 [12], ResNet-152 [9], ResNet-50 [9] and VGG-16 [34] with NASNet achieving best corresponding top-1 accuracy of 82.7% on the ImageNet [33] test set. We initialized our models using the pretrained weights from the ImageNet [33] models in order to benefit from transfer learning. The input samples from the 30cat dataset are scaled up by factor 1.15, randomly cropped to the input size and then randomly flipped in the horizontal and vertical direction as part of data augmentation. To provide comparable circumstances, all experiments were conducted with a fixed training time of 10 epochs, batch size of 20 samples, learning rate of 1e-4 and image sizes of \(224 \times 224\), \(299 \times 299\) and \(331 \times 331\) pixels depending upon the model in question. Data from empirical trials showed that despite sophisticated hyperparameter tuning, all models tend to overfit in a few epochs, resulting in a stagnant test set accuracy with increasing number of epochs. We selected a fixed training time of 10 epochs for all models. It is possible to obtain marginal gains in accuracy by employing a more sophisticated hyperparameter tuning strategy; however, the accuracy is very low to begin with for any useful real-world application.

Table 1 Accuracy comparison of state-of-the-art models to LeNet and AlexNet from [14] on the original 30cat dataset
Table 2 Accuracy comparison of IncResV2 and NASNet to the best performing image-based architecture ResNet50 ensemble from [20] on a subset of the 30cat dataset

The obtained results highlight only marginal gains even by employing some of the most sophisticated models to date, indicating that the task itself is hard for the current generation of deep learning models. Since all of the classes had the same number of examples in the test set, except for one, accuracy served as a good metric for performance. The computed metrics (top-1 and top-3 accuracies) are presented in Table 1. Results from Iwana et al. [14] are included for comparison. We achieved an absolute 6% gain in the top-1 and 10% gain in the top-3 accuracy over the baseline established by Iwana et al. [14] by employing NASNet with a top-1 accuracy of 30.5%, while ResNet-152 followed with an accuracy of 25.6%. Despite this improvement, the models trained for the task overfit to the training set primarily because of the high intra-class variance which made it extremely hard for the network to decipher the correct features (textual features) using a purely end-to-end training strategy.

Table 2 shows the results reported by Kjartansson et al.  [20] for top-1 and top-3 per-class accuracies on their best performing image-based ResNet-50 ensemble. In comparison, the per-class accuracies of single Inception ResNet v2 and NASNet models, trained and tested on the same subset of ten classes are presented. It can be seen that both single state-of-the-art models slightly outperform the ensemble method used in Kjartansson et al.  [20], with Inception ResNet v2 yielding the highest combined top-1 accuracy of 59.6%. However, considering the per-class accuracies, every model still has its fortes, indicating that an ensemble can result in further gains.

Buczkowski et al.  [4] reported their results on an unpublished dataset comprising of different categories and the number of samples, therefore evading the possibility for a direct comparison. Their dataset is comprised of 14 categories from, where one of the categories was named Others, obtained after merging together all the categories comprising of small numbers of examples. Unfortunately, a comparison of our results to the results of  [4] is not possible and is thus not included.

Discussion and Analysis

Fig. 2

Co-occurrence matrix, representing the number of mutual and exclusive occurrences of labels in the original dataset of 207k images and 32 categories

To better understand the nature of the classification problem, we analyzed the category distribution of the book cover dataset. Figure 2 shows the co-occurrence matrix representing the number of simultaneous occurrences of two main classes in the whole raw dataset containing 207k samples. From all classes, one is specifically prominent in this figure, i.e., the Reference class rarely occurs exclusively and is mixed with almost all other classes. This makes sense, as reference books are very common in scientific literature like natural sciences, law and economics. On the other hand, they are very uncommon in literature like comics, thrillers and romances. Another very prominent mutual occurrence is the one of Religion & Spirituality together with Christian Books & Bibles. This again is understandable, as Christian books are a subset of religious books. As the data have been collected from the American page, it is most likely that the subset of Christian books has been treated as a separate main class, as it specifically addresses the majority of the American customers. Moreover, the set of main categories is also given by’s system and is not necessarily optimal for classification. By looking at the co-occurrence matrix in detail, other overlapping classes can also be observed. For example, History seems to overlap with many classes, particularly Arts & Photography as well as Religion & Spirituality. Another striking overlap is the one of Literature & Fiction with Children’s Books. However, those specific mutual appearances can be comprehensible. Also, it is still in the nature of book genres to be overlapping, as books can consist of broad content and genres are very subjective.

Fig. 3

Plain book covers of books belonging to different categories

Fig. 4

Specifically designed series of book covers belonging to different categories

Another factor that adds complexity to this specific classification task is that the dataset exhibits low inter-class variance and high intra-class variance, which makes it extremely difficult for any image classification method to deal with. High intra-class variance pertains to the fact that there is a huge variety of different book covers present in a single category. Low inter-class variance, on the other hand, pertains to the fact that the book covers belonging to different categories are strikingly similar. Figures 3 and  4 provide an insight into the low intra-class variance issue where it can be seen that book covers containing very similar visual content belong to different classes. In many cases, a plain book cover (Fig. 3) or a specific design book cover (Fig. 4) occurred in 5–6 classes, where the only differentiating factor was the title which justified the assignment of that particular category. This means that if the textual information is discarded, it is impossible even for humans to assign the corresponding book cover to a particular category.

Fig. 5

T-SNE plot of 28cat dataset using softmax activations, obtained from an Inception ResNet v2 classifier

In contrast to the inter-class and intra-class variances, which are an inherent problem of book cover classification, the findings from category distribution analysis motivated us to cleanse the dataset. This shall clarify the task definition and therefore reduce confusion of the network during training, which ultimately leads to better accuracies. A subset was extracted from 30cat dataset in which the class Reference is removed from the dataset and the class Christian Books & Bibles is merged with the class Religion & Spirituality, resulting in 28 classes and only 55.1k samples. This dataset will be referred to as 28cat dataset in this paper.Footnote 1

Figure 5 visualizes the embeddings of a pretrained Inception ResNet v2 on the cleansed 28cat book cover dataset. Despite employing state-of-the-art image classification models, the embedding space is still highly overlapping, highlighting the complexity of the problem. There are also categories that seem to be well segregated such as the Test Preparation class since the covers are highly distinctive in that case.

Cleansed Dataset Evaluation

Table 3 Accuracies of Inception ResNet v2 architecture on original 30cat dataset and on 28cat subset

Based on the insights from “Discussion and Analysis” section, the impact of cleansing the dataset is quantified in the following. Two separate experiments have been conducted to simplify the classification problem with the 30cat dataset and to reduce the resulting confusion of the models. All further experiments have been conducted using the best performing Inception ResNet v2 architecture which was pretrained on the ImageNet dataset using the hyperparameters highlighted in “Experiments” section. For training, image data are scaled up by factor 1.15, randomly cropped to input size and randomly flipped in the horizontal and vertical direction. The other model is trained on 28cat subset. Table 3 shows that by removing Reference class and by merging both classes related to religion, an increase of 1.1% compared to the initial 30 classes was observed. This increase, although a minor one, reinforces the assumption that the occurrence of these subclasses caused confusion in the classification task. It needs to be mentioned that the accuracy is at least partly increasing naturally due to the simplification of the classification problem, as the number of classes is reduced. However, the choices of cleaning and merging classes from the dataset are justifiable as shown in Fig. 2 and improve the problem definition. Merging and redefining more subclasses could further increase the quality of the classification problem’s definition. This would enhance the utility of resulting classifiers and might further increase classification accuracies. Albeit, finding proper super-classes is a very subjective and complex task, requiring in-depth domain knowledge. Due to the positive effect of confusion reduction, all subsequent experiments are conducted on the cleaned 28cat subset.

Extensive Model Tweaking

We now extend the initial experiments with Inception ResNet v2 architecture in several new ways in order to better understand the complexity of the problem and unveiling key obstacles in the task of book cover classification. We first analyzed the effect of exhaustive data augmentation on the resulting classifier. We also benchmarked several attention mechanisms allowing the network to explicitly focus on parts of the book cover that were actually influential for a particular prediction. To this extreme, we also analyzed the impact of incorporating spatial transformer networks (STNs) [15] where the network can learn the full affine transformation of the input through backpropagation. We then assessed the impact of fusing textual information along with the visual cues in order to identify the gains through the two different information streams. Finally, we ensembled all the different models trained for the different experiments to highlight the possible gains through an ensembling scheme. All experiments are conducted on the 28cat dataset with hyperparameters as mentioned in “Experiments” section.

Data Augmentation

Data augmentation is a common technique used to artificially increase dataset sizes and ultimately avoid overfitting. Especially in image classification, plenty of different augmentation techniques have been proposed in the past. The experiment was conducted by training the Inception ResNet v2 model on augmented input data. The 55.1k samples from the 28cat subset were augmented by random flip on the horizontal and vertical axis, randomly changing contrast, hue and saturation and by random blurring, translation and rotation of the book cover images. The model, pretrained on ImageNet, is further fine-tuned for 10 epochs. The hyperparameters were kept constant as specified in “Experiments” section.

The computed metrics from the augmented network are reported in Table 4. As a reference, the accuracy of the baseline Inception ResNet v2 model on the 28 category dataset is given in the first row. Unfortunately, test accuracy decreased in comparison with the previous experiments. Since we fine-tune the network for a fixed number of epochs (10 in all our experiments), the introduction of additional noise into the network can increase the training time. Since we kept the training time to be the same, this might have been the reason for the drop in performance. However, there is a possibility that the used augmentation hampered the performance of the original network. More sophisticated strategies like AutoAugment [7] could also be introduced where the network learns data augmentation using a separate GAN trained end-to-end achieving state-of-the-art performance on the ImageNet [33] dataset.

Table 4 Inception ResNet v2 results on 28cat dataset: with & without augmentation

Attention Module

The use of attention mechanisms in book cover classification was already recommended by Kjartansson et al.  [20]. Jolly et al.  [16] observed that CNNs seem to heavily rely on objects in the book covers for classification. In addition, they found that smaller textual content, which is often crucial for the classification of book covers by humans, is of less relevance to the networks. Additionally, many book covers consist of mostly planar regions that do not contribute to the classification. Focusing on these regions could potentially result in severe overfitting. Therefore, to further investigate these findings, we experimented with several different variations of attention mechanisms on the basic structure of Inception ResNet v2 to identify their plausibility for the task of book cover classification. A basic schematic of the different modules is presented in Fig. 6. The same training parameters as in the previous sections were used. We now briefly explain the different methodologies employed to incorporate explicit attention into the network. The obtained results are presented in Table 5.

Fig. 6

Inception ResNet v2 additionally equipped with different attention strategies

Table 5 Inception ResNet v2 results on 28cat dataset: with and without Attention

Simple Attention Initially, we implemented a simple attention mechanism as proposed by Rodriguez et al.  [32]. A single \(1\times 1\) filter was applied to the model’s last convolutional feature map of size \(8\times 8\). The output is then normalized using the softmax activation function which serves as the attention over spatial locations. The resulting tensor is then element-wise multiplied to the initial feature map to exert attention. The simple attention mechanism implementation with softmax activation yielded only 17.1% accuracy. By inspecting the resulting attention masks, it has been found that the network drew all of its attention onto one specific spot of the feature map, which led to intense overfitting to the train data.

In order to enforce the diffusion of the attention mask, we augmented the attention mechanism by employing sigmoid and temperature-augmented softmax functions. The attention masks indeed showed a diffusion of attention onto specific areas that mostly showed objects, persons or big lettering. Temperature-augmented softmax resulted in a slight improvement over sigmoid achieving an accuracy of 31.0%. We visualize the attention maps computed from the temperature-augmented softmax in Fig. 7. Figure 7g shows the scale, indicating the respective attention of the network. Figure 7a highlights examples from the Cookbooks, Food & Wine class where the network correctly focused on the food items in order to come up with the correct prediction. The network only focused on the big lettering in order to identify the Test Preparation category as highlighted in Fig. 7b. For the Comics & Graphic Novels category, the network interestingly learnt to focus on the faces of the characters (Fig. 7c). For Engineering & Transportation category, the network learned to attend to cars and bikes which were very common (Fig. 7d). Italic and stylish fonts were quite common in the Romance category. Therefore, the network learnt to attend to these stylish fonts in order to correctly tell the class apart along with the faces (Fig. 7e) which is consistent with the findings of the previous work [16]. Finally, since the Law category was mainly comprised of textual content on the cover, the network learned to keep the text in focus as highlighted in Fig. 7f.

Fig. 7

Examples of attention maps from an Inception ResNet v2 model using an augmentation mechanism based on the temperature-augmented softmax function

Saliency-Based Attention As a follow-up, we implemented an attention mechanism based on saliency maps of the network’s input. This attention mechanism is meant to focus the network’s attention on salient regions containing text or objects. Thereby, irrelevant areas that contain irrelevant details will not lead to confusion. The input image is used to calculate an attention mask using Hou and Zhang’s method of spectral residual saliency detection [10]. This attention mask is then converted to a binary map using a manual threshold value of 10 (for values in range [0, 255]) and finally resized to \(8\times 8\). The resulting mask is again element-wise multiplied with the last convolutional feature map. Some examples of the upscaled attention masks are visualized in Fig. 8a and b.

Fig. 8

Examples of attention maps from an Inception ResNet v2 model using an augmentation mechanism based on saliency maps

The saliency-based attention mechanism focused strongly on objects as well as on textual content of all types. However, the resulting accuracy of 30.8% lies slightly below the previously mentioned approaches. Figure 8a shows some good examples on the first row and bad examples on the second. The good examples highlight a very precise focus on objects, symbols and important textual regions. However, sometimes important contexts like landscapes or big objects have been neglected due to their modest appearance, e.g., smooth color gradients.

Residual Attention The previous architecture was then modified to a combined approach. The saliency map, computed from the input image, is element-wise multiplied with the last convolutional feature map of shape \(8\times 8\). A residual attention mechanism with \(1\times 1\) convolution, followed by a Tanh activation function is element-wise multiplied to the last feature map as well. Those two tensors are then summed and passed to the output block of Inception ResNet v2. By applying the Tanh activation, the attention mechanism is able to dynamically adapt the attention that is given by the saliency input map. In a second experiment (Residual Stacked), an extension of this architecture was tested, where the trainable attention’s \(1\times 1\) convolution is preceded by a convolutional layer with 32 filters of kernel size \(3 \times 3\). This experiment was conducted with the aim that the trainable attention mechanism also takes context into account when deciding about the salient regions, as opposed to the case where only a \(1\times 1\) convolution is applied pixel-wise. The accuracies achieved by the two approaches vary significantly.

Spatial Transformer Networks Finally, we implemented two spatial transformer networks (STNs) [15] to incorporate hard attention. The first one is a conventional one, using a separate localization network made up of three blocks of max pooling, convolution and batch normalization layers. An intermediate feature map is again max-pooled and concatenated to the final feature map, flattened and further fed into a dense layer of 512 units and an output layer of six units for the affine transform parameters. The affine transform is then applied to the input image that is then fed into the classification network. The second approach used an intermediate feature of the classification network to produce the six affine transform parameters. A feature map of shape \(8 \times 8 \times 2080\) is flattened and fed into a dense layer with six units. The output is used to transform the input image and feed it back into the Inception ResNet v2 architecture.

Fig. 9

Examples of input images transformed by the transformer layer of a modified Inception ResNet v2 model

The STNs transformation turned out to be very unstable. Initially, the transformation layer extensively zoomed in, out or rotated the images. Most experiments led to transform parameters that would let the input image disappear completely, which resulted in accuracies almost at the level of random guessing. Examples of the resulting transformations of the input images are presented in Fig. 9. The results summarized in Table 5 indicate that the non-STN variants achieved higher accuracies. The common STN approach with a separate localization network resulted in a zoomed-out view of the book covers. The actual book cover did only cover a slight proportion of the input. Jaderberg et al.  [15] proved their method on MNIST [23], SVHN [28] and CUB-200-2011 [38].

STN provides the ability to transform the image into its canonical pose by optimizing the affine transform parameters so as to minimize the overall objective. In order to correct warping of the image, it is assumed that the same set of transform parameters generalize to the complete class or the canonical pose for the class is the same. In the case of book covers, the intra-class variance is significantly high, disabling the STN module to extract any generalizable transformation parameters from the datasets as the relevant features vary widely from image to image.

Although the incorporation of attention in the Inception ResNet v2 resulted in modest gains in accuracy (27.8% vs 31.1%), the main aim of this attention mechanism was to get a better understanding regarding what the network learned during its training phase. The obtained results indicate that the network learned to focus on the correct regions in some cases, but in most of the cases, there were no visual cues which the network can consistently exploit for classification. This makes the book cover classification task distinct from other classification problems like the ImageNet large-scale visual recognition challenge [33].

Table 6 Inception ResNet v2 results on 28cat dataset: with different loss metrics

Loss Metric

According to the recent findings in [27], it appears that cross-entropy loss could be problematic in some cases. For this experiment, the Inception ResNet v2 model was trained using mean-squared error (MSE) instead of the cross-entropy loss. The obtained results are presented in Table 6. The model yielded an accuracy of 30.1% after 10 epochs, which is an increase by 2.3% to the experiment with cross-entropy loss. Though, the increase is not significant and the loss metric does not seem to be the problem’s origin.

Table 7 Inception ResNet v2 results on 28cat dataset: with and without GAN augmentation
Fig. 10

Generated book cover samples from trained GAN

GAN Pretraining

In this experiment, GAN-generated images were used for pretraining the basic Inception ResNet v2 model that was previously pretrained on ImageNet [33]. For generating the book cover samples, we modified the base architecture of the Progressive-GAN [19] framework. 551k generated samples from 28 categories were used. After pretraining this model on the generated samples, it was further fine-tuned on the real images from 28cat subset for 10 epochs. Figure 10 shows generated samples from trained GAN, where Fig. 10a, b and c shows generated book cover for Children’s Books, Mystery and Medical Books genre. Two experiments were conducted with the pretrained model stopped after seven and ten epochs, in order to examine the influence of pretraining.

Table 7 indicates that by using the models pretrained on GAN images, the accuracies decreased by 1.3% and 2.2% after training the model for 7 and 10 epochs, respectively. It also appears that longer pretraining on GAN images worsens the result, as pretraining for three more epochs resulted in a further decrease of 1% in accuracy. One very plausible explanation for this drop in accuracy is the poor quality of conditioning of the generated samples.

Incorporation of Textual Modality

As textual content on the book cover is usually extremely important for humans to classify a particular book cover, we evaluated the relevance of the book’s titles for corresponding classification problem with the assumption that the text present in a book cover is much more descriptive for the corresponding classification as compared to visual cues. We used the title information available in the dataset to implement a text-based classifier in order to examine the potential of text incorporation. The text-based classifier is implemented using sentence vectors from FastText [17]. In addition, different ensembles of text and image classifiers were evaluated. The text embeddings for the ensembles were obtained using FastText as well. For the image-based classifier, the basic Inception ResNet v2 architecture pretrained on ImageNet was used. Three different variants of the network were tested. We first evaluated the early-fusion scheme, where the sentence embeddings are concatenated to the channels’ axis of the CNN’s input. The sentence vectors are broadcasted in height and width dimensions. In the next experiment, we tested the late-fusion scheme, where the sentence embeddings are concatenated to the flattened tensor from the last convolutional layer and fed into an additional dense layer with 4096 units and ReLU activation function. Finally, the last variant was tested, combining both early and late fusion in one network (dual fusion). A scheme of the different variants is presented in Fig. 11. Furthermore, we used the embeddings from the pretrained text and image ensemble classifiers to train a support vector machine (SVM) classifier. For this experiment, the previously mentioned ensemble with late fusion was used.

Fig. 11

Architectural diagram for experiments conducted on text–image ensembles

It is evident from Table 8 that the classifier trained on just the title of the book present on the book cover (embedded using FastText embeddings) is significantly superior in terms of accuracy (55.6%) as compared to the image-based classifier (27.8%). This is consistent with our understanding of the problem. It is also evident from the table that fusing both the textual and visual information results in deterioration of performance in almost all of the cases, instead of improving it. In the case of late fusion, the improvement is statistically insignificant.

Table 8 Inception ResNet v2 results on 28cat dataset: Multimodal (Text & Image)

Model Ensembles

An ensemble is a set of multiple, mutually complementary classifiers whose predictions are combined in order to benefit from their varying distributions. By combining several classifiers of lower complexity, a more complex representation can be achieved, suitable for difficult problems. Ensembling is commonly employed to further boost the classification performance of the individual classifiers. Given the high number of models trained in the previous sections, it is a natural approach to apply ensembling to examine the potential gains.

For ensemble learning, many different approaches are available. In the following, a simple voting scheme is used to combine the predictions of different models. From the unity of all model’s predictions, the label is chosen that has the most votes. In case that two or more labels have an equal number of votes, one is chosen at random. Different combinations have been evaluated. The first combination (referred to as Ensemble 1) includes the strongest models trained on 28cat so far, including Inception ResNet v2 with MSE loss, attention with temperature-augmented softmax, saliency-based attention and residual attention. All of these models yielded test accuracies of more than 30%. The ensembling approach resulted in an accuracy of 33.9%, surpassing the best model by 2.8%. Ensemble 2 additionally included the models, fine-tuned on the GAN generated images for seven and ten epochs. Both models yield significantly lower accuracies of 26.5% and 25.6%, respectively. Table 9 shows that even adding those two weaker models increased test accuracy again by 1.1%. Given the performance boost obtained by employing new models, another combination (Ensemble 3) was evaluated, consisting of nine different models. In addition to the models from Ensemble 2, the augmented model and two differently initialized models based on Inception ResNet v2 were included. All of them yielded test accuracies lower than 27%. The resulting test accuracy of 36.6% outperformed the best single model by 5.5%. Further inclusion of models resulted in a negligible gain in classification accuracy, hence omitted for clarity.

Table 9 Inception ResNet v2 Results on 28cat dataset: Ensemble

The obtained results advocate that ensembling results in an increase in the obtained accuracies for the book cover genre classification task despite low independent accuracies of some models. This is plausible, as ensembling benefits from a high variance in model initialization, architectures and thus combinations of various local minima, resulting in slightly different fortes of mapping the given input data distribution to the target distribution. By systematically choosing models with complementary fortes, performance could be further improved. This is consistent with the findings by Kjartansson et al. [20]. However, solving the book cover classification problem requires solving an interim task, i.e., OCR, and provides poor textural cues both of which are significant impediments for the current generation of deep models.


Book cover classification is an intriguing research question along with having practical value. We, therefore, evaluated the efficacy of employing the state-of-the-art deep learning models in the direction of classification of book covers. Despite a range of experiments performed, the obtained results for image-based classification significantly under-performed the simple text-based classifier. A plausible explanation of this poor performance can be the violation of the i. i. d. (independent and identically distributed) assumption. Although the generated samples are independent, the samples are not exactly identically distributed since they are only limited by the imagination of the artists, unlike natural images, which are usually identically distributed.

With the obtained results and analysis, it is evident that the current generation of state-of-the-art models is unable to solve these tasks to a satisfactory level of performance. Therefore, significant efforts need to be invested in order to solve this task. These advances will cover the development of more sophisticated deep learning models as well as specific strategies to improve their applicability to this problem. A particularly important direction in this regard would be the development of advanced feature extraction techniques which can learn the correct set of invariants for the task which is in itself a very hard problem to solve.


  1. 1.


  1. 1.

    Afzal MZ, Capobianco S, Malik MI, Marinai S, Breuel TM, Dengel A, Liwicki M. Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th international conference on document analysis and recognition (ICDAR); 2015. p. 1111–5.

  2. 2.

    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2018. vol. 3, p. 6.

  3. 3.

    Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE. 2015;10(7):2015.

    Google Scholar 

  4. 4.

    Buczkowski P, Sobkowicz A, Kozlowski M. Deep learning approaches towards book covers classification. In: International conference on pattern recognition applications and methods (ICPRAM); 2018. p. 309–16.

  5. 5.

    Chen LC, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation; 2017. arXiv:1706.05587.

  6. 6.

    Chiu CC, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2018. p. 4774–8.

  7. 7.

    Cubuk ED, Zoph B, Mané D, Vasudevan V, Le QV. Autoaugment: learning augmentation policies from data. CoRR; 2018. arXiv:abs/1805.09501.

  8. 8.

    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems (NIPS); 2014. p. 2672–80.

  9. 9.

    He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2016. p. 770–8.

  10. 10.

    Hou X, Zhang L. Saliency detection: a spectral residual approach. In: The IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2007. p. 1–8.

  11. 11.

    Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–41.

  12. 12.

    Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–8.

  13. 13.

    Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. Squeezenet: alexnet-level accuracy with 50\(\times\) fewer parameters and< 0.5 mb model size; 2016. arXiv:1602.07360.

  14. 14.

    Iwana BK, Rizvi STR, Ahmed S, Dengel A, Uchida S. Judging a book by its cover; 2016. arXiv:1610.09204.

  15. 15.

    Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems (NIPS); 2015. p. 2017–25.

  16. 16.

    Jolly S, Iwana BK, Kuroki R, Uchida S. How do convolutional neural networks learn design? In: 2018 24th international conference on pattern recognition (ICPR). IEEE; 2018. p. 1085–90.

  17. 17.

    Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers. Association for Computational Linguistics; 2017. p. 427–31.

  18. 18.

    Karayev S, Trentacoste M, Han H, Agarwala A, Darrell T, Hertzmann A, Winnemoeller H. Recognizing image style; 2013. arXiv:1311.3715.

  19. 19.

    Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of gans for improved quality, stability, and variation; 2017. arXiv:1710.10196.

  20. 20.

    Kjartansson S, Ashavsky A. Can you judge a book by its cover? Stanford CS231N; 2017.

  21. 21.

    Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS); 2012. p. 1097–105.

  22. 22.

    LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

    Article  Google Scholar 

  23. 23.

    LeCun Y, Cortes C. MNIST handwritten digit database; 2010.

  24. 24.

    Libeks J, Turnbull D. You can judge an artist by an album cover: using images for music annotation. IEEE Multi Med. 2011;18(4):30–7.

    Article  Google Scholar 

  25. 25.

    Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in context. In: European conference on computer vision. Springer; 2014. p. 740–55.

  26. 26.

    Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. Playing atari with deep reinforcement learning; 2013. arXiv:1312.5602.

  27. 27.

    Nar K, Ocal O, Sastry SS, Ramchandran K. Cross-entropy loss leads to poor margins. OpenReview; 2019.

  28. 28.

    Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY. Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning; 2011. vol. 2011, p. 5.

  29. 29.

    Nilsback ME, Zisserman A. Automated flower classification over a large number of classes. In: Sixth Indian conference on computer vision, graphics & image processing, 2008. ICVGIP’08. IEEE; 2008. p. 722–9.

  30. 30.

    Oramas S, Barbieri F, Nieto O, Serra X. Multimodal deep learning for music genre classification. Trans Int Soc Music Inf Retr. 2018;1(1):4–21.

    Google Scholar 

  31. 31.

    Oramas S, Nieto O, Barbieri F, Serra X. Multi-label music genre classification from audio, text, and images using deep features; 2017. arXiv:1707.04916.

  32. 32.

    Rodríguez P, Cucurull G, Gonzàlez J, Gonfaus JM, Roca X. A painless attention mechanism for convolutional neural networks; 2018.

  33. 33.

    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L. ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV). 2015;115(3):211–52.

    MathSciNet  Article  Google Scholar 

  34. 34.

    Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv:1409.1556.

  35. 35.

    Sobkowicz A, Kozłowski M, Buczkowski P. Reading book by the cover-book genre detection using short descriptions. In: International conference on man–machine interactions. Springer; 2017. p. 439–48.

  36. 36.

    Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI conference on artificial intelligence; 2017. vol. 4, p. 12.

  37. 37.

    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR) 2015.

  38. 38.

    Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-UCSD birds-200-2011 dataset 2011.

  39. 39.

    Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. Tacotron: a fully end-to-end text-to-speech synthesis model; 2017. arXiv:1703.10135.

  40. 40.

    Yao JG, Wan X, Xiao J. Recent advances in document summarization. Knowl Inf Syst. 2017;53(2):297–336.

    Article  Google Scholar 

  41. 41.

    Yu F, Seff A, Zhang Y, Song S, Funkhouser T, Xiao J. Lsun: construction of a large-scale image dataset using deep learning with humans in the loop; 2015. arXiv:1506.03365.

  42. 42.

    Zoph B, Vasudevan V, Shlens J, Le QV. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8697–710.

  43. 43.

    Zujovic J, Gandy L, Friedman S, Pardo B, Pappas TN. Classifying paintings by artistic genre: an analysis of features & classifiers. In: IEEE international workshop on multimedia signal processing, 2009. MMSP’09. IEEE; 2009. p. 1–5.

Download references


This work was supported by the BMBF project DeFuseNN (Grant 01IW17002) and partially supported by JSPS KAKENHI (Grant JP17H06100). We thank all members of the Deep Learning Competence Center at the DFKI for their comments and support.

Author information



Corresponding author

Correspondence to Adriano Lucieri.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

“This article is part of the topical collection “Document Analysis and Recognition” guest edited by Michael Blumenstein, Seiichi Uchida and Cheng-Lin Liu”.

The source code and the models are available at .

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lucieri, A., Sabir, H., Siddiqui, S.A. et al. Benchmarking Deep Learning Models for Classification of Book Covers. SN COMPUT. SCI. 1, 139 (2020).

Download citation


  • CNN
  • Book cover
  • Book cover classification