Through a set of experiments, we investigated the effectiveness of FractalDB and how to construct categories with the effects of configuration, as mentioned in Sect. 3.3. We then quantitatively evaluated and compared the proposed framework with Supervised Learning (ImageNet-1k and Places-365, namely ImageNet (Deng et al. 2008) and Places (Zhou et al. 2017) pre-trained models) and SSL (Deep Cluster-10k (Caron et al. 2018)) on several datasets (Krizhevsky 2009; Deng et al. 2008; Zhou et al. 2017; Everingham et al. 2015; Lake et al. 2015). In SSL, we used DeepCluster-10k because this method is the most similar to the proposed method from the perspective of pseudo labels based on the specific function. In DeepCluster-10k, k-means clustering is applied to create labels from convolutional features.
Implementation Details
To confirm the properties of FractalDB and compare our pre-trained feature with previous studies, we principally used ResNet-50. Several architectures such as AlexNet and the ResNet-family are investigated in Table 9; however, the other experiments are conducted using only ResNet-50. We simply replaced the pre-training phase with our FractalDB (e.g., FractalDB-1k/10k), without changing the fine-tuning step. Moreover, in using the fine-tuning datasets, we conducted a standard training/validation. For pre-training and fine-tuning, we used the momentum stochastic gradient descent (SGD) (Bottou 2010) optimization algorithm with a momentum value of 0.9, a basic batch size of 256, and an initial learning rate of 0.01. The learning rate was multiplied by 0.1 when the learning epoch reached 30 and then again at epoch 60. Training was performed up to epoch 90. Moreover, the input images were cropped to a size of \(224\times 224\) [pixel] from a \(256\times 256\) [pixel] input image. We implemented only random cropping as a data augmentation method, since our goal is to evaluate the potential of FractalDB pre-training in a simple manner.
Tunings and Comparisons
We explored the configuration of formula-driven image datasets regarding fractal generation by comparing the models trained on variously configured FractalDBs. We evaluate their performance on CIFAR-10/100 (C10, C100), ImageNet-100 (IN100), and Places-30 (P30) datasets. Considering the computational resource, we assigned IN100 and P30 as a replacement for the ImageNet-1k and Places-365 datasets. We randomly selected 100/30 categories from ImageNet-1k and Places-365 datasets. The parameters correspond to those mentioned in Sect. 3.3. Additionally, we compared the best practice in FractalDB pre-training to the related pre-training on representative datasets.
#Category and #Instance
In Figs. 5a–d, we plot the performance of FractalDBs, configured with various numbers of category and instance, to investigate their effects. We investigate the parameters with {16, 32, 64, 128, 256, 512, 1000} on both properties. Here, we find that the larger values tend to be better. At the beginning, a larger parameter in pre-training tends to improve the accuracy in fine-tuning on all the datasets. With C10/100, we can see +7.9/+16.0 increases in performance as #category increases from 16 to 1,000. Performance improvements are also discernable as #instance per category increases, albeit to a lower extent: +5.2/+8.9 on C10/100.
Hereafter, we assigned 1,000 [category] \(\times \) 1,000 [instance] as a basic dataset size and tried to train 10k categories since the #category parameter is more effective in improving performance.
Patch versus Point
In Table 1, we investigate effects of the different sized filters in the generation process. Table 1 shows the difference between \(3 \times 3\) [pixel] patch rendering and \(1 \times 1\) [pixel] point rendering. Here, we find that Patch with 3 \(\times \) 3 [pixel] is better. We can confirm that the \(3 \times 3\) [pixel] patch rendering is better for pre-training with 92.1 vs. 87.4 (+4.7) on C10 and 72.0 vs. 66.1 (+5.9) on C100. Moreover, when comparing random patch patterns to fixed patch in image rendering, performance rates increased by {+0.8, +1.6, +1.1, +1.8} on {C10, C100, IN100, P30}.
Filling Rate
In Table 2, we investigate the effects of the different filling rates. The top scores for each dataset and the parameter are 92.0, 80.5 and 75.5 with a filling rate of 0.10 on C10, IN100 and P30, respectively. Based on these results, although there are no significant changes between {0.05, 0.10, 0.15}, a filling rate of 0.10 appears to be better.
Table 1 Exploration: patch versus point Table 2 Exploration: filling rate Table 3 Exploration: weights Table 4 Exploration: #Dot Table 5 Exploration: image size Table 6 Grayscale versus color for the pre-training model Table 8 Classification accuracies of Ours (FractalDB-1k / 10k), Scratch, DeepCluster-10k (DC-10k), ImageNet-100/1k and Places-30/365 pre-trained models on representative pre-training datasets Weight of Intra-Category Fractals
In Table 3, we investigate the effects of intra-category variance by changing the intervals as follows. Starting from the basic parameter at intervals of 0.1 with {0.8, 0.9, 1.0, 1.1, 1.2} (see Fig. 4), we varied the intervals as 0.1, 0.2, 0.3, 0.4, and 0.5. For the case in which the interval is 0.5, we set {0.01, 0.5, 1.0, 1.5, 2.0} in order to avoid the weighting value being set as zero. A higher intra-category variance tends to provide higher accuracy. We confirm that the accuracies varied as {92.1, 92.4, 92.4, 92.7, 91.8} on C10, where 0.4 is the highest performance rate (92.7), but 0.5 decreases the recognition rate (91.8). We conclude an interval of 0.4 to be the best. We used the weight value with a 0.4 interval, i.e., {0.2, 0.6, 1.0, 1.4, 1.8}.
#Dot
In Table 4, we investigate the effects of the different numbers of dots by comparing 100k, 200k, and 400k dots. The best parameters for each configuration are 100K on C10 (91.3), 200k on C100/P30 (71.0 / 74.8), and 400k on IN100 (80.0). Although a larger value is suitable on IN100, a lower value tends to be better on C10, C100, and P30. For the #dot parameter, we select 200k considering the balance in terms of rendering speed and accuracy.
Image Size
In Table 5, we investigate the effects of the different image sizes. In terms of image size, \(256 \times 256\) [pixel] and \(362 \times 362\) [pixel] perform similarly, e.g., 73.6 (256) vs. 73.2 (362) on C100. A larger size, such as \(1024 \times 1024\), is sparse in the image plane. Therefore, the fractal image projection produces better results in the cases of \(256 \times 256\) [pixel] and \(362 \times 362\) [pixel]. Here, a larger image size with a large amount of #dot can clearly represent the fractal geometry. However, due to the limitation of computational resources and pixel characteristics, we set the image size in rendering time as \(362 \times 362\).
Grayscale/Color Configuration
In Table 6, we investigate the difference between two configurations with grayscale and color FractalDB. In pre-training on the FractalDB, the two configurations were compared, and the results for color were found to be slightly better. The effect of the color property does not appear to be strong in the pre-training phase, e.g., 93.1 (w/ color) vs. 92.9 (w/o color) on C10.
Training Epoch
In Table 7, we explore the three types of training terms in FractalDB-1k: 90, 120, and 200 epochs in the pre-training phase. According to the results, we can confirm that the effect of longer-term training (200 epochs) is relatively higher than shorter term training using 90 or 120 epochs.
Best Practice in FractalDB Pre-trained Model
We further explored the set of parameters in the FractalDB pre-trained model. According to the results of the explorative study and additional tuning with parameter combinations, the highest accuracies occurred in #category (1000/10,000), #instance (1,000), patch (fixed \(3 \times 3\) patch in an image), filling rate (0.2), weight of intra-category fractals (0.4), #dot (200k), image size (\(362 \times 362\)), color configuration (random color), and training epoch (200 epochs). The performance rates are shown in Table 8.
Comparison to Other Pre-trained Datasets
We compared Scratch from random parameters, Places-30 / 365 (Zhou et al. 2017), ImageNet-100/1k (ILSVRC’12) (Deng et al. 2008), and FractalDB-1k/10k in Table 8. Since the hyperparameters of representative learning configuration are different depending on the publication, we implemented all frameworks fairly with the same parameters and compared the method (FractalDB-1k/10k) to the baselines (Scratch, DeepCluster-10k, Places-30/365, and ImageNet-100/1k). The hyperparameters are already shown in the implementation details.
The proposed FractalDB pre-trained model recorded several good performance rates. We respectively describe them by comparing our Formula-driven Supervised Learning with Scratch, Self-supervised and Supervised Learning.
Comparison to Training from Scratch
FractalDB-1k/10k pre-trained models recorded much higher accuracies than models trained from scratch on relatively small-scale datasets (C10/100, VOC12 and OG). In case of fine-tuning on large-scale datasets (ImageNet-1k/Places-365), the effect of pre-training was relatively small. However, in fine-tuning on Places-365, the FractalDB-10k pre-trained model helped to improve the performance rate which was also higher than ImageNet-1k pre-training (FractalDB-10k 50.8 vs. ImageNet-1k 50.3).
Table 9 Other architectures Comparison to Self-Supervised Learning
We assigned DeepCluster-10k (Caron et al. 2018) to compare the automatically generated image categories. The 10k denotes pre-training with 10k categories. We believe that the auto-annotation with DeepCluster is the most similar method to our formula-driven image dataset. DeepCluster-10k also assigns the same category to images that have similar image patterns based on K-means clustering. Our FractalDB-1k/10k pre-trained models outperformed DeepCluster-10k on five different datasets, e.g., FractalDB-10k 94.1 versus DeepCluster 89.9 (C10), 77.3 versus DeepCluster-10k 66.9 (C100). Our method is thus superior to DeepCluster-10k which is a self-supervised learning method to learn feature representations in image recognition.
Comparison to Supervised Learning
We compared four types of supervised pre-training (e.g., ImageNet-1k and Places-365 datasets and their limited categories ImageNet-100 and Places-30 datasets). ImageNet-100 and Places-30 are subsets of ImageNet-1k and Places-365. The numbers correspond to the number of categories. At the beginning, our FractalDB-10k surpassed the ImageNet-100/Places-30 pre-trained models on all fine-tuning datasets. The results show that our framework is more effective than pre-training with subsets from ImageNet-1k and Places-365.
We compare the supervised pre-training methods that currently represent the most promising pre-training approach. Although our FractalDB-1k/10k is not superior to them in all settings, our method partially outperformed the ImageNet-1k pre-trained model on Places-365 (FractalDB-10k 50.8 vs. ImageNet-1k 50.3) and Omniglot (FractalDB-10k 29.2 vs. ImageNet-1k 17.5) and Places-365 pre-trained model on CIFAR-100 (FractalDB-10k 77.3 vs. Places-365 76.9) and ImageNet (FractalDB-10k 71.5 vs. Places-365 71.4). The ImageNet-1k pre-trained model is much better than our proposed method on fine-tuning datasets such as C100 and VOC12 since these datasets contain similar categories such as animals and tools.
Comparison with Other Architecture Ablations
We further compare the proposed pre-trained models in several architectures. We assigned eight representative architectures, namely, AlexNet, VGGNet-{16, 19}, ResNet-{18, 50, 152}, ResNeXt-101, and DenseNet-161. The results are shown in Table 9. However, during the experiment, we could not optimize FractalDB pre-trained VGGNet-{16, 19}. Therefore, accuracies with VGGNet-{16, 19} are not included in the table.
Table 10 The classification accuracies of the FractalDB-1k/10k (F1k/F10k) and DeepCluster-10k (DC-10k) Table 11 Freezing parameters In the ResNet-family architectures on ResNets, ResNeXt-101, and DenseNet-161, we confirmed a similar tendency. The FractalDB pre-trained models achieved the top accuracies on OG and Places-365, better results on C100. From the results on C10, the FractalDB pre-trained models seem to increase the performance rates depending on the deeper layers, from 18 to 152. On the other hand, the FractalDB pre-trained AlexNet also assists with fine-tuning on the ImageNet-1k dataset. The gap between scratch and FractalDB pre-training was +2.5 pt (FractalDB-10k, 59.0, vs. Scratch, 56.5). According to the experiments on several CNN architectures, the proposed FractalDB is effective in the pre-training phase.
Explorative Study
We also validated the proposed framework in terms of (i) category assignment, (ii) convergence speed, (iii) freezing parameters in fine-tuning, (iv) comparison to other formula-driven image datasets, (v) model ensemble, (vi) recognized category analysis, and (vii) visualization of first convolutional filters and attention maps.
Table 12 Other formula-driven image datasets with Bezier Curves DataBase (BCDB) and Perlin Noise DataBase (PNDB) in addition to FractalDB (FDB). Category assignment (see Fig. 6 and Table 10)
At the beginning, we validated whether the optimization can be successfully performed using the proposed FractalDB. Figure 6 shows how the pre-training accuracy varies as a function of label noise. We randomly replaced the category labels. Here, 0% and 100% noise indicate normal training and fully randomized training, respectively. According to the results on FractalDB-1k, a CNN model can successfully classify fractal images, which are defined by iterated functions. Moreover, well-defined categories with a balanced pixel rate allow optimization on FractalDB. When fully randomized labels were assigned in FractalDB training, the architecture could not classify any images and the loss value was static (the accuracies are 0% at most). According to the result, we confirmed that the effect of the fractal category is reliable enough to train the image patterns.
Moreover, we used DeepCluster-10k to automatically assign categories to the FractalDB. Table 10 shows the comparison between category assignment with DeepCluster-10k (k-means) and FractalDB-1k/10k (IFS). We confirm that DeepCluster-10k cannot successfully assign a category to fractal images. The gaps between IFS and k-means assignments are {11.0, 20.3, 13.2} on {C10, C100, VOC12}. This clearly indicates that our category assignments in FDSL, through the principle of IFS and the parameters in equation (2), work well compared to DeepCluster-10k.
Convergence Speed (see Fig. 2b)
The transitioned pre-training accuracies values in FractalDB are similar to those of ImageNet pre-trained model and much faster than scratch from random parameters (Fig. 2b). We validated the convergence speed in fine-tuning on C10. As a result of pre-training with FractalDB-1k, we accelerated the convergence speed in fine-tuning which is similar to the ImageNet pre-trained model. According to the findings on pre-training in He et al.He et al. (2019), the FractalDB pre-training can also promotes faster transfer learning on additional datasets.
Freezing Parameters in Fine-Tuning (see Table 11)
Although full-parameter fine-tuning is better, conv1 and 2 acquired a highly accurate image representation (Table 11). Freezing the conv1 layer provided a \(-1.1\) (92.3 vs. 93.4) or \(-3.5\) (72.2 vs. 75.7) decrease from fine-tuning on C10 and C100, respectively. Compared to the other results, such as those for conv1–4/5 freezing, the bottom layer tended to train a better representation. The FractalDB pre-training did not learn from natural images; therefore, the fixed layers fine-tuning is not effective. The FractalDB pre-trained model must train middle layers to acquire natural image representations in the fine-tuning phase.
Comparison to other Formula-Driven Image Datasets (see Table 12)
Thus far, the proposed FractalDB-1k/10k are better than other formula-driven image datasets. We used Perlin noise (Perlin 2002) and Bezier curves (Farin 1993) to generate image patterns and their categories just as FractalDB dataset.
We confirmed that both Perlin noise and Bezier curves are also beneficial in terms of making a pre-trained model which can achieve better rates than training from scratch. However, the proposed FractalDB is better than these approaches (Table 12). For a fairer comparison, we cite a similar #category in the formula-driven image datasets, namely FractalDB-1k (total #image: 1M), Bezier-1024 (1.024M) and Perlin-1296 (1.296M). The significantly improved rates are +3.0 (FractalDB-1k 93.4 vs. Perlin-1296 90.4) on C10, +4.6 (FractalDB-10k 75.7 vs. Perlin-1296 71.1) on C100, +3.0 (FractalDB-1k 82.7 vs. Perlin-1296 79.7) on IN100, and +1.7 (FractalDB-1k 75.9 vs. Perlin-1296 74.2) on P30.
Table 13 Performance rates in which FractalDB was better than the ImageNet pre-trained model on C10/C100/IN100/P30 fine-tuning Ensemble Model (see Fig. 7)
The FractalDB pre-trained model helps to improve accuracy with a model ensemble in addition to a single model. Figure 7 shows the results for a 20-model ensemble with FractalDB-1k. The final accuracy reaches 94.7/79.3 on C10/C100 datasets.
Recognized Category Analysis (see Table 13)
We investigated which categories are better recognized by the FractalDB pre-trained model compared to the ImageNet pre-trained model. Table 13 shows the category names and classification rates. The FractalDB pre-trained model tends to be better when an image contains recursive patterns (e.g., a keyboard, maple trees).
Visualization of First Convolutional Filters (see Fig. 8a–e) and Attention Maps (see Fig. 8f)
We visualized first convolutional filters and Grad-CAM (Selvaraju et al. 2017) with pre-trained ResNet-50. As seen in ImageNet-1k/Places-365 / DeepCluster-10k (Fig. 8a, b, e) and FractalDB-1k/10k pre-training (Fig. 8c, d), our pre-trained models clearly generate different feature representations from conventional natural image datasets. Based on the experimental results, we confirmed that the proposed FractalDB successfully pre-trained a CNN model without any natural images even though the convolutional basis filters are different from the natural image pre-training with ImageNet-1k/DeepCluster-10k.
The pre-trained models with Grad-CAM can generate heatmaps fine-tuned on the C10 dataset. According to the center-right and right in Fig. 8f, the FractalDB-1k/10k also look at the objects.