Computational Visual Media Computational

In this paper, we introduce an image dataset for ﬁne-grained classiﬁcation of dog breeds: the Tsinghua Dogs Dataset. It is currently the largest dataset for ﬁne-grained classiﬁcation of dogs, including 130 dog breeds and 70,428 real-world images. It has only one dog in each image and provides annotated bounding boxes for the whole body and head. In comparison to previous similar datasets, it contains more breeds and more carefully chosen images for each breed. The diversity within each breed is greater, with between 200 and 7000+ images for each breed. Annotation of the whole body and head makes the dataset not only suitable for the improvement of ﬁne-grained image classiﬁcation models based on overall features, but also for those locating local informative parts. We show that dataset provides a tough challenge by benchmarking several state-of-the-art deep neural models. The dataset is available for academic purposes at https://cg.cs.tsinghua.edu.cn/ThuDogs/.


Introduction
Dogs are closely involved in human lives as family members, and are very common as pets.On the other hand, the number of dog-related incidents of injury and uncivilized behavior is increasing.This leads to a need for dog identification using modern visual technology, both for dog recognition and finer-grained classification to breed.
Fine-grained classification is a non-trivial problem, requiring to distinguish different subclasses from subtle inter-class differences.As for other visual tasks, the performance of fine-grained classification has been greatly boosted by the use of deep neural networks [1][2][3][4].However, there are relatively small differences between dogs of different breeds while there can be relatively large differences between those within a breed due to geographic isolation or hybridization.See, for example, Fig. 1: great Dane dogs have multiple colors, while dogs of different breeds, such as Norwich terriers and Australian terriers, may have similar colors.Existing datasets, such as the widely used Stanford Dogs Dataset [5], are not diverse enough to cover such variations, limiting their use for training and testing algorithms.
This paper contributes a new dataset, Tsinghua Dogs, with an emphasis on fine-grained dog classification.It contains 130 breeds of dogs in 70,428 images, with one dog per image, over 65% of which were collected from everyday life.It covers nearly all dog breeds currently found in China.Each breed in our dataset contains at least 200 images, up to a maximum of 7449 images, basically in proportion to their frequency of occurrence in China, so it significantly increases the diversity for each breed over existing datasets.Furthermore, we have annotated bounding boxes of the dog's whole body and head in each image, which can be used for supervising the training of learning algorithms as well as testing them.
We have also benchmarked several classification methods on our dataset, including both general neural networks and fine-grained models which exhibit good performance on other fine-grained datasets.The results show that the large diversity of our dataset proves to be a tougher challenge, so should be beneficial in the development and testing of algorithms for real-world applications.

Fine-grained classification
Fine-grained classification technology is an obvious next step from traditional coarse classification technology [6][7][8][9].Coarse classification is generally intended to distinguish different types of objects such as animals and vehicles, while fine-grained classification usually needs to differentiate subclasses within a class, such as breeds of animals or makes or models of vehicles.CUB200-2011 [10] is a wellknown fine-grained classification dataset containing 200 different bird species-see Fig. 2.
The main difficulties for fine-grained classification are typically the large number of fine-grained Fig. 2 Birds in the CUB200-2011 dataset [10].
categories, and high intra-class but low inter-class variance.Currently, the best fine-grained classification methods use machine learning techniques, in particular deep neural networks.Research into fine-grained classification considers at least the following issues.

Locating informative parts
In order to distinguish different subclasses, an intuitive approach is to explicitly take advantage of differences between corresponding object parts.Handcrafted features [11,12] are extracted from object parts and fed to linear classifiers, such as SVMs.Deep learning methods provide better performance, with the parts located and normalized for pose [13][14][15].
Since only a few key parts are useful for finegrained classification, Lam et al. [16] proposed to only search for informative parts in the deep feature map.Chen et al. [17] first decomposed the input image into local parts and found the discriminative regions by reconstructing the image.Ge et al. [18] explored complementary object parts in addition to the dominant one.Du et al. [19] fused parts at various granularities for better performance.Recently, the idea of identifying the most informative parts to provide more robust performance is achieved by exploiting a spatial attention mechanism, such as multi-attention [20], recurrent attention [21], trilinear attention [22], and multi-scale object and part attention [23].Sun et al. [24] introduced diversification blocks in feature maps to find the most discriminative differences between closely confusing classes.
Bilinear pooling based models can also implicitly learn the informative local parts.Lin et al. [25] explored pairwise relations of local parts using a bilinear pooling of outer products of features from two convolutional extractors.Gao et al. [26] proposed a more compact bilinear pooling.Yu et al. [27] exploited hierarchical bilinear pooling to account for interaction of features between layers.

Learning from image pairs
Learning discriminative cues directly from an image pair is more intuitive since human beings can easily tell fine-grained classes by comparing given image pairs.Metric learning, which is a typical solution for measuring the similarity between image pairs, has also been used for fine-grained classification, e.g., using triplet loss design [28,29], maximum entropy [30], a multi-stage method [31], multi-attention multi-class constraints [32], and pairwise confusion regularization [33].These methods are mainly designed to separate images in feature space, but are less capable of discriminating subtle differences between confusing images.Recently, Zhuang et al. [34] suggested finding contrasting cues directly from a pair of images via attentive pairwise interaction.This method achieves state-of-the-art performance on several fine-grained classification datasets.

Data augmentation
Whatever method is used, more meaningful training data always helps to train a more general model [35].A common approach is to use search engines, crawlers, etc. to search for relevant images and text [36] on the Internet, and to use it to train the fine-grained classification model.However, there is a huge amount of noise in such data [37], and techniques are required to suppress this noise and extract valid information.Hu et al. [38] proposed a weakly supervised data augmentation network (WS-DAN) to augment images guided by attention maps generated by weakly supervised learning.Our dataset ensures data diversity by collecting more samples from real life.

Datasets for fine-grained classification
To help develop and assess fine-grained classification technology, researchers have released many public fine-grained classification datasets.In addition to the aforementioned CUB200-2011 dataset [10], there are Stanford Cars [39], FGVC Aircraft [40], Oxford 102 Flowers [41], and other datasets.
Stanford Dogs is a public fine-grained classification dataset for dog breeds [5].It contains 20,580 images of 120 dog breeds, with 150-252 images for each breed.The images in this dataset are clear and obvious; for each dog, its whole body bounding box is annotated.
Other dog datasets have also been provided for classification tasks.For example, ImageNet-1K [42] contains about 116,000 pictures of 117 dog breeds.Some general datasets also contain dog images, but as a single category, without any finegrained classification information.For example, there are 2079 dog bounding boxes in the VOC dataset training data (2007 and 2012) [43] and 530 images containing dogs in the verification data; in the COCO [44] dataset, there are 5508 bounding boxes of dogs.Our proposed dataset focuses on fine-grained dog classification, and provides sufficient diversity for each breed to test deep neural model generalization.

Tsinghua Dogs Dataset
We now introduce how we constructed our Tsinghua Dogs Dataset and present its statistical features.

Data collection
Our data capture system has collected more than 100,000 images of dogs captured and uploaded by users in three Chinese cities.We removed sensitive information from the data and selected more than 46,000 images to build the dataset.As the numbers of images of each breed of dog reflects their actual distribution in these three cities, there is a long tail to this data.Teddy dog pictures are the most frequent (7449 images), while Cassell pictures are the least frequent (4 images).See Fig. 3.
While this reflects the real distribution of dog breeds, to make the dataset friendly to algorithms, we wish to ensure that each breed has no fewer than 200 pictures, to ensure diversity of images for each breed.Therefore, we also added data from the Stanford Dogs dataset and by using an image search engine.We integrated 18,000 pictures with only one dog from Stanford Dogs into our dataset.We also crawled and manually selected more than 6000 pictures using Baidu image search to ensure that our dataset contained no fewer than 200 pictures per breed.
We removed duplicate images in the dataset by computing image structural similarity (SSIM) [45].After collecting the data, we determined the true dog breed in the images through expert review.We also asked the annotators to filter out low-quality images, i.e., any that were seriously blurred, deformed, occluded, or where the dog was too small a part of the image.The final number of images in the Tsinghua Dogs Dataset is 70,428, from a total of 130 breeds, with no less than 200 images per breed.Part of the image of the dataset is shown in Fig. 4.

Data annotation with active learning
Manual labeling using tools such as Labelme [46] is laborious and inefficient (annotating 300-500 pictures per hour).To reduce the effort of manual labeling, we used an active learning strategy to label the dataset in a semi-automatic manner which can increase the efficiency to 1500-2500 pictures per hour.The approach was: 1.For 2000 randomly chosen pictures from the dataset, label tight bounding boxes around the dog's whole body, and the dog's head (including the ears), as shown in Fig. 5.

Train the RetianNet model and automatically
generate 2000 new data annotations.3. Manually correct labellings with low confidence by our own adjustment tool (see Fig. 6) using only keyboard interaction, insert those data into the training data, and return to step 2. Repeat until all the data are labeled.

Statistics
Using the above pipeline, 70,428 dog images were annotated for our Tsinghua Dogs dataset, including about 46,000 images of dogs taken in Chinese cities, 18,000 images from the Stanford Dogs dataset and, 6,000 images downloaded from Baidu, Google and other image search engines.The total number of dog breeds is 130.Each image contains a single dog, annotated with bounding boxes of the dog's head and the whole dog (see Fig. 7).We compare our dataset, CUB-200, and Stanford Dogs in Table 1.
We now give some statistics for our dataset.The number of dogs for each breed varies from 200 to 7449 (Teddy dogs).Figure 8 shows numbers of the 24 most common dogs in the dataset.Statistics on the fraction of the whole image covered by the bounding   box of the dog's head are given in Fig. 9, while the fraction of the whole image covered by the dog's body's bounding box is indicated in Fig. 10.
The images do not have a fixed resolution.Very few pictures have a length or width less than 100 pixels, with a minimum of 60, and most images have a relatively high resolution.Image resolution statistics are shown in Fig. 11.At least half of our images have higher resolution than those in the Stanford Dogs dataset.

Benchmarking using Tsinghua Dogs
Although most deep neural fine-grained classification models can be retrained on dog datasets, as has been done for the Stanford Dogs [5] x , these methods are usually not optimized for images of dogs.Furthermore, we argue that the diversity within current dog datasets does not provide an adequate test.In this section, we first discuss training procedures for several models we have benchmarked on our dataset.We show benchmarking results using our dataset and analyze how the additional diversity in our dataset improves the robustness of fine-grained classification models.
We took PMG z as our base model, and used ResNet50 as the backbone network.Parameter settings strictly adhered to those in the original paper.The PMG model starts from the bottom stage network and trains the network stage-by-stage.Each stage is trained with images spliced from image patches of the size specified in the original paper.The experiment used a learning rate of 0.002 for the newly added stage, with a cosine annealing schedule to reduce the learning rate.We trained 200 rounds on each dataset.The input was a 448 × 448 image cropped from the center after scaling the original image to 550 × 550.The batch size was 16.
The backbone network of TBMSL-Net { is also ResNet50.TBMSL-Net can automatically learn the location and the key parts of an object in an input image.Its final fine-grained classification score is given by combining whole graph features.We used the same algorithm window and other parameter settings for CUB.Both the object and the original image were resized to 448 × 448, but the image of the key part of the object is resized to 224 × 224.The optimizer used was stochastic gradient descent (SGD).The momentum was 0.9, and the weight was 0.0001.The initial learning rate was 0.001; it was multiplied by 0.1 after 60 epochs.We trained 200 rounds in total.
WS-DAN x improves the performance of image classification through two mechanisms: one extracts significant features from the image to make the image appearance more effective; the other focuses on the location of the target so that the model can observe the target more "closely" to improve performance.The size of the input images was 512 × 448; 80 epochs were used.SGD optimization was used with a momentum of 0.9 and weight decay of 0.00001.The initial learning rate was 0.001 with exponential decay of 0.9 after every 2 epochs.
For Inception V3 y we used the training settings from Ref. [42].Each image size is resized to 224×224, and 200 rounds of training were completed.The initial learning rate was 0.05, and it was adjusted as follows: if the accuracy on the validation set did not increase after 10 rounds, then the learning rate was multiplied by 0.1.The optimization function was again SGD (momentum = 0.9) with a batch size of 64.The penultimate layer used a dropout of 0.4.
We split the training data of CUB 200-2011 [10] into training and validation sets according to train test split.txt.The number of images in the training and validation sets is 5994 and 5794, respectively.Standford Dogs has 12,000 training images and 8580 validation images.Our dataset also provides labels for training and validation (randomly selecting 40 images for each breed), with 65,228 and 5200 cases respectively.

Results and analysis
Various deep neural classification models have reported their performance on Stanford Dogs.Inception V3 achieved an accuracy of 88.9%, while WS DAN ranked 1st with an accuracy of 92.2% z .However, these models are not optimized for fine-grained classification of dogs, and performance would degrade in real-world applications.

Results
Although PMG [19] reported its classification accuracy on three fine-grained classification datasets, CUB 200-2011 [10], Stanford Cars [39], and FGVC-Aircraft [40], the model has not been tested on Stanford Dogs.To ensure a fair and effective comparison, we first tested the PMG model on CUB to verify that our trained PMG model gave results consistent with the original paper.Then we trained the PMG model on Stanford Dogs and our dataset with the same training parameters.The performance of the PMG model on these datasets is shown in Table 2.As can be seen clearly from the comparison, the accuracy of the PMG model on our dataset is lower than on Stanford Dogs by about 3%, demonstrating that our dataset presents a greater challenge for finegrained dog classification.
We also benchmarked the accuracy of the deep neural networks described above on our Tsinghua Dogs dataset: see Table 3.Notice that the accuracy of Inception V3 drops by more than 10% from Stanford Dogs to Tsinghua Dogs, while WS DAN decreases by over 5%.These results imply that current stateof-the-art fine-grained models still have considerable room for improvement.

Further analysis
To better understand how the diversity of our new dog dataset improves the robustness of classification, we now consider a qualitative analysis of the classification results.We we qualitatively show classification results for seven breeds.For dogs in columns 2-5, the model trained on Tsinghua Dogs succeeded in finding the right breed, while the model trained on Stanford Dogs failed.Real-world dogs are captured from various directions, giving a wide variation in appearance even for the same breed.Our Tsinghua Dogs incorporates more diversity and is thus more suitable for developing generalized deep neural models for fine-grained dog classification.

Conclusions
This paper has introduced a new challenging finegrained classification dog dataset, Tsinghua Dogs.Our dataset contains 130 dog breeds and 70,428 images, with bounding boxes annotated for locations of the dog and its head.The diversity of the dataset and its additional annotation allow the construction of more robust and accurate deep neural fine-grained models needed for real-world applications.

Fig. 1
Fig. 1 Dog variations in our dog dataset.(a) Great Danes exhibit large variations in appearance, while (b) Norwich terriers and (c) Australian terriers are quite similar to each other.

Fig. 8
Fig. 8 Top 24 breeds of dogs by number of images.

Fig. 9
Fig. 9 Fraction of the image covered by the dog's head bounding box.

Fig. 10 Fig. 11
Fig. 10 Fraction of the image covered by the dog's body bounding box.

Fig. 12
Fig. 12 Qualitative comparison of WS DAN models trained on Stanford Dogs and Tsinghua Dogs.Dogs in each row belong to the same breed.WS DAN trained on Tsinghua Dogs classifies the dogs correctly except for the last column, while the one trained on Stanford Dogs gives a correct classification only for the first column.

Table 1
Dataset comparison