In this section, the Date Estimation in the Wild dataset (Sect. 3.1) and the two proposed baseline approaches to predict the acquisition year of images (Sect. 3.2) are described in detail.
3.1 Image Date Estimation in the Wild Dataset
The Flickr API was utilized to download images for each year of the period from 1930 to 1999. We have observed that many historical images are supplemented with time information, either in the title or in the related tags and descriptions. Therefore, we used the current year as an additional query term to reduce the number of “spam” images. The only kind of filtering that we applied was restricting the search to photos. As a consequence, the dataset is noisy since it contains, for example, close-ups of plants or animals as well as historical documents. In order to avoid a bias towards more recent images, the maximum number of images per year was limited to 25000. Finally, the dataset consists of 1029710 images with a high diversity of concepts. Information about the granularity \(g \in \{0,4,6,8\}\) according to the Flickr annotation of the date entry is stored as well. The distribution of images per year and the related granularity of dates are depicted in Fig. 2.
In order to obtain reliable validation and test sets that match the dataset distribution, a maximum number of 75 unique images for 1930 to 1954 and 150 unique images for the remaining years were extracted. A unique image is defined as an image with a date granularity of \(g=0\) (Y-m-d H:i:s) or \(g=4\) (Y-m), for which no visual near-duplicates (detected by comparing the features from the last pooling layer of a GoogLeNet pre-trained on ImageNet) exist in the entire dataset. Subsequently, 8495 unique images were extracted for the validation set and another 16 per year were selected manually to obtain the test dataset containing 1120 images. The remaining 1020095 images constitute the training set. The datasetFootnote 1 is available at https://doi.org/10.22000/0001abcde.
3.2 Baseline Approaches
Two baseline approaches are realized by training a GoogLeNet [11] and treating image date estimation as a classification or regression problem, respectively.
Convolutional neural networks require many images per class c to learn appropriate models for the classification task. However, the dataset lacks images for the first three decades (Fig. 2). For this reason, we decided to use \(\left| c\right| = 14\) classes by quantizing the image acquisition year into 5-year periods to reduce the classification complexity, while still maintaining a good temporal resolution. For the classification task, GoogLeNet was trained using Caffe on a pre-trained ImageNet model [8]. We randomly selected 128 images per batch for training, which were scaled by the ratio \({256}/{\min (w,h)}\) (w and h are image dimensions). To augment training data, the images were horizontally flipped and cropped randomly to fit in the reception field of \(224\times 224\times 3\) pixel. The stochastic gradient descent algorithm was employed using 1M iterations with a momentum of 0.9 and a base learning rate of 0.001 to reduce the classification loss. The weights of the fully connected (fc) layers are re-initialized and their corresponding learning rates are multiplied by 10. The output size of the fc layers is set to the number of classes and the learning rates were reduced by a factor of 2 every 100k iterations.
Test images are scaled by the ratio \({224}/{\min (w,h)}\) and three \(224 \times 224\) pixel regions depending on the images’ orientations are passed to the trained model. To estimate a specific acquisition year \(y_E\), the averaged class probabilities p(c) of the three crops for each class \(c \in [0, 13]\) are interpolated by:
$$\begin{aligned} y_E = 1930 + \left\lfloor 0.5 + \frac{1999 - 1930}{\left| c\right| - 1} \cdot \sum _{i = 0}^{\left| c\right| - 1}{i \cdot p(i)}\right\rfloor , \quad \text {with} \sum _{i = 0}^{\left| c\right| - 1}{p(i)} = 1. \end{aligned}$$
(1)
For the regression task, the Euclidean loss between the predicted and ground truth image date was minimized. We used the same parameters for learning as in classification except for: The base learning rate was reduced to 0.0001 and a bias of 1975 (middle year) for the fc layers was used to stabilize training. Finally, the output size was set to 1 for regression to directly predict the year.