Evolutionary convolutional neural network for image classification based on multi-objective genetic programming with leader–follower mechanism

As a popular research in the field of artificial intelligence in the last 2 years, evolutionary neural architecture search (ENAS) compensates the disadvantage that the construction of convolutional neural network (CNN) relies heavily on the prior knowledge of designers. Since its inception, a great deal of researches have been devoted to improving its associated theories, giving rise to many related algorithms with pretty good results. Considering that there are still some limitations in the existing algorithms, such as the fixed depth or width of the network, the pursuit of accuracy at the expense of computational resources, and the tendency to fall into local optimization. In this article, a multi-objective genetic programming algorithm with a leader–follower evolution mechanism (LF-MOGP) is proposed, where a flexible encoding strategy with variable length and width based on Cartesian genetic programming is designed to represent the topology of CNNs. Furthermore, the leader–follower evolution mechanism is proposed to guide the evolution of the algorithm, with the external archive set composed of non-dominated solutions acting as the leader and an elite population updated followed by the external archive acting as the follower. Which increases the speed of population convergence, guarantees the diversity of individuals, and greatly reduces the computational resources. The proposed LF-MOGP algorithm is evaluated on eight widely used image classification tasks and a real industrial task. Experimental results show that the proposed LF-MOGP is comparative with or even superior to 35 existing algorithms (including some state-of-the-art algorithms) in terms of classification error and number of parameters.


Introduction
Convolutional neural network (CNN) is a typical representative of neural network architectures. Due to its powerful feature extraction capability, it has shown excellent advantages on a variety of competing vision-related tasks, including images classification [1,2], text detection [3], industrial data analysis [4], etc. In the past few years, researchers have conducted a variety of interesting studies on CNNs, including the design of architectures (depth, width, etc.) and the enhancement of learning capabilities (feature extraction and exploitation, propagation of loss functions, etc.). Representatives of CNNs include AlexNet [5], GoogLeNet [6], VGG [7], ResNet [8], DenseNet [9], etc., all of which have achieved quite remarkable results.
Although these CNNs have been designed with great success, the construction of CNN architectures is by no means an easy task. For machine learning scholars, CNNs are like a finely crafted work of art that requires constant tuning to opti-mize the mechanisms and parameters of the network until a good combination, a suitable regular planner, and optimized parameters are found. This process is inseparable from a wealth of prior knowledge and a huge amount of work, and many scholars have put considerable efforts just to design a satisfactory network architecture [10].
Fortunately, the potential of evolutionary algorithm (EA) in neural architecture searching (ENAS) has attracted considerable attention in the last few years and has yielded very promising results [11,12]. Systematic reviews of ENAS can be found in [13,14], where several representative algorithms are introduced, such as Genetic CNN and Evo-CNN (CNN architectures searching based on genetic algorithm) [15,16], EAS and Meta-QNN (CNN architectures searching based on Q-learning) [17,18], Large-scale Evolution [19], CGP-CNN (CNN architectures searching based on Cartesian genetic programming) [20], NAS (CNN architectures searching based on reinforcement learning) [21], and CNN-GA and AE-CNN (CNN architectures searching based on genetic algorithm and block encoding) [22,23], etc. Experimental results of these methods have shown that pretty good performance in searching for the optimal network structures and excellent results have been achieved. But there are still some important limitations that need to be addressed. First, some of these algorithms still require considerable expertise, e.g., the EAS algorithm is based on a primary network, while the selection of the primary network still requires considerable empirical knowledge. Second, the performance of some ENAS algorithms is highly dependent on computational resources, e.g., it takes 28 days to train NAS on CIFAR10 even with 800 GPUs. Third, some of these methods, such as genetic CNNs, employ the encoding strategy with a fixed length or a fixed width, which means that the depth or width of CNNs is fixed. Since the performance of CNNs depends heavily on their depth and width, it is desirable that the architecture of CNNs could be flexible and versatile to ensure good generalization ability to different tasks.
The purpose of this paper is to design an algorithm that can autonomously evolve neural networks for the task of image classification, which can overcome the limitations of the existing methods described above. In view of the excellent performance of genetic programming in many practical applications, the idea of the binary diagram of Cartesian genetic programming (CGP) [24] is introduced into this paper to encode the structure of CNN architectures. The main contributions of this paper are as follows: (1) A flexible coding strategy of variable length and width is proposed based on CGP, and 22 alternative function blocks are designed. Where the tree-like structure of CGP with fewer structural constraints can represent the topology of CNNs well. Meanwhile, the efficient alternative function blocks can make the algorithm more efficient and can expand the search space, which in turn provides more possibilities for finding more high-quality architectures. (2) A multi-objective genetic programming with a leaderfollower mechanism is designed, where the external archive of non-dominated solutions and the elite population act as leader and follower, respectively. This mechanism can speed up the convergence while preventing the algorithm from getting trapped in a local optimum to a large extent. (3) The effectiveness of the proposed algorithm is validated on eight benchmark datasets that are widely adopted in images classification tasks and a real-world industrial dataset. Computational results illustrate the superior performance of the proposed algorithm over various state-of-the-art algorithms.
The rest of this paper is organized as follows. The next section presents an introduction of the background and related works on the design of CNNs. Then the details of the proposed LF-MOGP are presented in the subsequent section followed by which the experiment design is introduced to evaluate the performance of the proposed algorithm. Then the results and analysis of the experiments are given. The penultimate section presents a case of LF-MOGP applied to a real-world industrial problem of slab number recognition. Finally, the conclusion and future works are presented.

Cartesian genetic programming algorithm
As an evolutionary computation technique, genetic programming (GP) is capable of automatically evolving models to solve real-world problems based on the principles of evolution and natural selection in the biological world [25], which is well-known for its flexibility and high interpretability compared to other evolutionary algorithms and is widely adopted in image classification, scheduling and regression tasks [26]. As a highly representative branch of GP, CGP can flexibly encode various computing structures while avoiding the bloat problem in GP [27][28][29]. Which is capable of achieving high robustness and generalization due to its remarkable self-organization, self-learning, and self-adaptive properties, and is more effective than other GP algorithms in many complex problems such as parameters optimization, scheduling, resource allocation, and complex network analysis [30,31].
The general form of CGP is shown in Fig. 1, which is represented by a directed graph with index nodes. In this form, there are n inputs and m outputs, and the output is obtained from the nodes of the last column. The size of the  Fig. 1 is n × c. The nodes of the same column cannot be connected to each other, and the connection between columns is also restricted by the levelback (e.g. level-back = 3 means that the nodes in ith column can connect to column i − 3 at most). The red dashed path 'a-b-c-d' represents a simple neural network with five nodes.

ResNet block
The creation of ResNet is a landmark event in the development of CNNs, which has made extraordinary contributions to mitigating gradient loss and utilizing features. The success of ResNet is mainly attributed to the design of its building blocks, especially the shortcut connections. During forward propagation, the presence of the hopping structure allows the input signal to propagate directly from any lower level to the higher level, and the loss will not be attenuated by any intermediate weight matrix. For the hopping structure, when the input and output dimensions are the same, the corresponding dimensions can be added directly and without any other operation. But if the two dimensions are inconsistent, the smaller dimension will be expanded by 0-padding or 1 × 1 convolution to make the input and output dimensions consistent. Figure 2 shows a typical example of ResNet consisting of three convolution layers and a skip connection, where the line chart shows the jump propagation of loss values.

Algorithm overview with leader-and-follower mechanism
by non-dominated sorting and crowding distance; The framework of the proposed LF-MOGP is shown in Algorithm 1, which consists of the following steps. In the beginning, the items such as E, F t , P are initialized, where the initialization of the population based on the basic functional blocks is explained in detail in Algorithm 2, and the basic blocks designed in our algorithm are described in Sect. 3.2.
Then, the CNN corresponding to each solution in the population is constructed and evaluated separately. Here two evaluation metrics are introduced, the maximization of classification accuracy (Acc) and the minimization of complexity of the model, their detailed descriptions can be found in Sect. 3.6.
After that, the comparisons of the non-dominated relationship of all the solutions are performed, and then the leader-follower mechanism comes into play, the flow diagram of the leader-follower mechanism is shown in Fig. 3 (the red line represents the leader E and the yellow circle represents the follower F t , and the black arrow represents the update of the solution set). Specifically, the non-dominated solutions are stored in E, which will act as the leader during evolution. For the rest solutions in P, K elite solutions will be selected based on the crowding distance and stored in F t , which will act as the follower during evolution. In the early stage of the algorithm, the non-dominated solutions in external archive E will act as the leader. That is, the parent solutions are mainly selected from E, which helps to achieve a fast convergence speed. While in the later stage of the algorithm, the parent solutions are selected from both the leader E and follower F t , which can improve the search diversity thus preventing the algorithm from falling into local optimum. In the main loop of the LF-MOGP algorithm, the leader and the follower will be updated iteratively and will collaborate to achieve optimization of the relevant metrics with higher accuracy and lower model complexity, as shown in the third graph of Fig. 3, migrating to the upper right corner.
For the generation of new solutions, based on the characteristics of CGP encoding, we design both mutation and crossover operators. Obviously, the leader-follower mechanism allows the evolutionary process to focus on nondominated solutions rather than the whole population at the early stage, which can greatly reduce the computational resource requirements and also speed up the convergence rate. The new solutions generated are also used to update the solutions in F t , thus allowing F t to follow the changes of E, which can improve the individuals' diversity in the later stage of the algorithm. That is, the leader-follower mechanism is able to achieve a better balance between exploration and exploitation while reducing computational resources.

Encoding and decoding strategy
Since the optimal structure of a CNN (including width and depth) is unknown when dealing with a specific problem, the encoding strategy employed must satisfy the variability of the depth and width of the CNN so that the proposed LF-MOGP algorithm has the opportunity to find the optimal structure of the CNN without the limitation of the search space. In this regard, CGP has exactly such flexibility, and each node function in CGP can be easily replaced with an efficient function block, which makes CGP a good representation of CNN structure. In addition, the CGP-based encoding strategy has fewer restrictions on crossover and variation operations. It does not have the traditional restrictions on the length or width of the parent individuals by the crossover and variation operators, which can further expand the search space and provide the possibility of finding the optimal structure of CNNs.
As mentioned above each node in CGP is replaced with an efficient function block. To increase the efficiency of the algorithm, the ResNet block mentioned in Sect. 2.2 is introduced as a basic block. Besides, another four types of functional blocks, namely, ConvBlock, Pooling, Concat, and Sum are also designed. These blocks contain several subblocks depending on their internal parameter settings. Table 1 shows the specific design of parameters in each functional block and its corresponding sub-blocks.
These functional blocks follow the following naming rules. Taking ConvBlocks as an example, their names are changed from C1 to C9, and the number and size of convolution kernels are increased accordingly. That is, these 9 ConvBlocks CB_32_1, CB_32_3, CB_32_5, CB_64_1, CB_64_3, CB_64_5, CB_128_1, CB_128_3, CB_128_5 are denoted as C1, C2,..., C9 in that order. ResNet Blocks are also named in a similar way. The average and max pooling are denoted by P1 and P2, respectively. The names of the other function blocks are consistent with their symbols.
ConvBlock is used for feature extraction. The parameters of the convolution are as follows. The kernel size is chosen from {1 × 1, 3 × 3, 5 × 5}, the step is set to 1, and the input will be padded with 0 before the convolution operation. To alleviate the gradient dispersion during training, a batch normalization is performed after each convolution. ResNet Block is processed by the standard convolution. The size of the convolution kernel is selected from {1 × 1, 3 × 3, 5 × 5}, padding for half of the size of the convolution kernel size. After convolution, the batch normalization and ReLU function will be adopted. The specific form of ResNet is shown in Fig. 2 above. Pooling includes the maximum one and the average one. The size of the filter is set to 2×2, and the step is set to 2. Since the pooling process is equivalent to downscaling the features, the number of pooling layers is bounded by the size of the input image of d × d, and the maximum number of the adopted pooling block is log 2 d. Concat is designed to merge the feature maps at the channel level. If the two feature maps to be concatenated have the same number of rows and columns, they will be merged directly at the channel level, otherwise, we will down-sample the one with a larger feature map by the maximum pooling to make the two features have the same size. Then the final feature map F can be represented as follows: where the M 1 ×N 1 ×C 1 and M 2 ×N 2 ×C 2 are two dimensions of the input images.
Sum is designed to merge the feature maps at the pixel level. Similar to the Concat block, if the two feature maps to be summed are in different numbers of rows or columns, we will down-sample the one with a larger feature map by the maximum pooling to make the two features have the same size. In addition, if the two features to be summed have different channels, the one with the smaller number of channels will be expanded with a 1 × 1 convolution operator to make them have the same dimension. The output of the feature map F can be represented as follows: where the M 1 ×N 1 ×C 1 and M 2 ×N 2 ×C 2 are two dimensions of the input images. Figure 4 illustrates the procedure of how to decode the genotype of a solution into its corresponding CNN architecture. As shown in Fig. 4, a solution consists of three items, node_id, is_active and gene. Specifically, the gene of the first node is (C4, 0, 0), and according to the naming rules mentioned earlier, C4 indicates that Convblock is selected, the number of convolution kernels is 64, and the size of the  Table 1. With respect to the pooling block in the dotted box in the corresponding CNN architecture below, it does not actually work here because the sixth value of is_active is False.

Population initialization
The initialization procedure of the population is illustrated in Algorithm 2, where each solution is produced according to the employed encoding and decoding strategy.

Algorithm 2 Population Initialization
minimum nodes N l , maximum nodes N u ; Output: Initialized population P; , P ← ∅; n p ← log 2 d ; // Get the maximum number of pooling layers for i = 1, · · · , N do n pi ← 0; k ← Uniformly generate the number of nodes (depth of the network) in (N l , N u ]; gene ← Initialize an empty array of size (k, 3) to store the block type and connection IDs; is_active ← Initialize a boolean array of size k to store the activation type of the nodes; for j = 1, · · · , k do b ← Randomly choose a block from the alternative blocks; if b is POOLING BLOCK then end n pi ← n pi + 1; if n pi > n p then b ← Reselect one block from remaining blocks; end (c m , c n ) ← Randomly generate connection ID from [0, j) for c m and c n respectively; gene j ← Determine the j th value of gene (gene j ) based on b and (c m , c n ); end S i ← Get the solution by combining gene and is_ative; P ← P ∪ S i ; end Return P The generation of a solution consists of the following steps. First, the parameters and an empty solution are initialized. The next step is to select the type and connections for each block of the solution. Notice that, since the pooling is a downsampling operation, so the number of the adopted pooling blocks must be less than log 2 d [20], otherwise, b must be re-selected from the rest of the blocks except the pooling ones. Furthermore, each block can only be connected to the nodes whose position precedes it, that is, both c m and c n must be less than j. When all the genotypes of the k nodes are determined, they will be encoded into the corresponding blocks according to the encoding strategy described in Sect. 3.2. Finally, the solution S i corresponding to a CNN network is obtained by combining the gene and is_active.

Mutation
The mutation operation proposed in our method includes the following three types, Adding, Removing, and Modi f ying, respectively. The detailed process of the mutation operation is presented in Algorithm 3. The process of mutation operation consists of the following steps. First, the number of nodes in the parent individual p is counted and one position is randomly selected for the mutation. Second, a mutation type is randomly selected from the three alternative types, and then the mutation operation is performed. Please note that this mutation process may result in a dimension mismatch in the mutated offspring. For this possible case, a dimension matching discriminant and restoration module is proposed to determine if there is a mismatch in the dimension or the image size in the mutated offspring. If so, the 1 × 1 convolution will be adopted to make the two dimensions consistent, and the maxpool operation will be used to downsample the large-size input so that the two inputs have the same size. Finally, the mutated offspring q is obtained and returned.

Crossover
Unlike the traditional crossover operation that requires two parent individuals to have the same length, in our algorithm, the two parent individuals can have different lengths. Fewer constraints here, so a larger search space is gained. The detailed procedure of this crossover operation is presented in Algorithm 4.
Specifically, the crossover operation consists of the following steps. First, based on the two random crossover positions the single-point crossover operation is performed. After restructuring, new offsprings are preliminary generated. A simple example to illustrate the crossover process is presented in Fig. 6. Next, a dimension matching discriminant is carried out, and if necessary, their dimensions are repaired in a similar way like the mutation operator. Finally, Algorithm 4 Crossover Operation of LF-MOGP Input: genotype of parents P 1 , P 2 ; Output: Crossed offspring Q 1 , Q 2 ; pos 1 , pos 2 ← Randomly choose two crossover positions of P 1 , P 2 ; P 11 , P 12 ← Separate genotype of P 1 at position pos 1 ; P 21 , P 22 ← Separate genotype of P 2 at position pos 2 ; Q 1 ← Combine genotypes P 11 , P 22 ; Q 2 ← Combine genotypes P 21 , P 12 ; Q 1 , Q 2 ← Evaluate and repair dimensions of the new offspring; Return Q 1 , Q 2 the offspring solutions generated by the crossover operation are obtained.

Fitness evaluation
The evaluation of a solution consists of two conflicting objectives: the classification accuracy (Acc) and the complexity of the corresponding CNN. The classification accuracy is calculated by the ratio of the number of incorrectly classified images to the total number of images, which is widely adopted in image classification tasks. The complexity is calculated by the number of trainable parameters of the corresponding CNN, as done by Sun et al. [22]. The purpose of choosing these two metrics is to obtain CNN architectures with high accuracy but low complexity.

Benchmark datasets
In the experiments, eight benchmark datasets that are widely used for images classification tasks are adopted to evaluate the performance of the proposed LF-MOGP. These benchmark datasets include CIFAR10, CIFAR100, Fashion, MB, MRI, MRB, MRD, and MRDBI. CIFAR10 and CIFAR100 cover the colorful images of objects such as cars and boats. The difference between them is that CIFAR100 covers 100 categories of objects, and each image of CIFAR100 contains a fine label except for super label. Fashion covers some fashion objects such as coats and shirts, and images covered in Fashion are grayscale. MNIST and its variants are established for the classification of ten hand-written digits (i.e., 0-9). The variants MRI, MRB, MRD, and MRDBI introduce different obstacles to MB (e.g., rotation, random noise, background images), which significantly increase the complexity of the classification tasks. The examples of these benchmark datasets are shown in Figs. 7, 8, and 9, respectively. A more detailed description of these datasets is provided in Table 2.

Experimental setting
The LF-MOGP algorithm is implemented in Pytorch, and all the experiments are carried out on a personal computer with one GeForce RTX 3090 GPU, Intel(R) Xeon(R) Silver 4110 CPU, and 32 GB RAM. The relevant details of the experiments can be described as follows. The maximum of the rows, columns, level_back, the minimum, and maximum of active nodes allowed in the neural networks are set to 5, 30, 10, 7, 30, respectively, which are determined by preliminary experimental experiences. Additionally, 22 alternative basic blocks are designed, which are shown in Table 1. The population size of P is set to 30, and the maximum of generations for population evolution is set to 100. The maximum of the

Overall results
To verify the effectiveness of the proposed LF-MOGP algorithm, a series of comparison experiments with 36 powerful competitor algorithms including the state-of-the-art ones are conducted.
Since the re-implementation of the competitor algorithms may not be able to achieve the same performance reported in the original papers. To make a fair comparison, for each dataset in our experiment, we collected the experimental results of these algorithms from the original papers. Furthermore, the competitor algorithms were often experimented with different datasets. Therefore, different competitor algorithms may be chosen for different datasets. More specifically, the best classification performance results of the proposed LF-MOGP and its competitors are shown in Tables 3, 4, and 5, which correspond to the MNIST and its variant datasets, Fashion, and CIFAR, respectively. Note that all the results on the competitors given in the tables are reported in their papers, where the symbol '−' implies that there is no public recorded result by the corresponding peer competitors.
In Table 5, not only the classification error but also the number of trainable parameters and 'GPU Days' are investigated to evaluate both the accuracy and complexity. Note that 'GPU Days' is just a reference indicator of computational consumption because the performance of different GPUs is different, and the specific experimental environment of each algorithm can be seen in the last row of Table 5. Specifically, if the algorithm employs 3 GPUs and runs for 7 days, then the corresponding 'GPU Days' will be 21. In contrast, the classification errors and training epochs are given in Table 4, while  Table 3 only gives the classification errors due to fact that the competitors only reported the best classification errors of their algorithms and do not report other relevant results.

A. MNIST and its variants
As shown in Table 3, regarding the best classification performance, LF-MOGP can outperform all the compared competitors on the MNIST and its variants, except for the third-best performance on the MRD dataset. More specifically, the best classification error for MB, MRB, MBI and MRDBI obtained by the ENAS algorithms, are 0.79%, 2.44%, 4.06% and 17.92%, respectively. However, our LF-MOGP can further reduce the best classification error to 0.52%, 2.41%, 3.08% and 14.98%. Especially for the MB dataset, our LF-MOGP achieves a classification accuracy of nearly 99.5%. Moreover, for the MRDBI dataset whose difficulty of the classification is highest, the best performance among the compared algorithms is 17.92% obtained by SEECNN, but our LF-MOGP reduces the classification error to 14.98%, which demonstrates the advantage of LF-MOGP in dealing with complex classification tasks. When compared with the two GP-based algorithms(IEGP and FGP), the proposed LF-MOGP also demonstrates an absolute advantage, with significantly lower classification errors than both algorithms on all queryable datasets.

B. Fashion
Nine peer competitors including the four methods (2C1P2F, 2C1P, 3C2F, 3C1P2F+Dropout) collected from the website (https://github.com/zalandoresearch/fashion-mnist) of the Fashion dataset are adopted here to evaluate the performance of LF-MOGP, and the statistical results are shown in Table 4. As shown in Table 4, LF-MOGP obtained the second-best classification error, and among all the competitors, the lowest classification error is 3.09% obtained by Fine-Tuning DARTS. However, it is worth pointing out that Fine-Tuning DARTS is a handcrafted model, where the cutout and random erasing data augmentation techniques were adopted during training. Besides, Fine-Tuning DARTS is a DARTS-based fine tuning algorithm, while the proposed LF-MOGP does not use any additional data augmentation techniques and is trained from scratch instead of fine tuning. Except for the Fine-Tuning DARTS, LF-MOGP reduces the best two classification errors by 2.52% and 2.72%, respectively, compared to VGG16 and GoogleNet. Compared to the three ENAS algorithms (EvoCNN, SEECNN and FPSO), LF-MOGP achieves the highest classification accuracy without significantly increasing parameters. Compared with FPSO, the number of parameters of our model increases by 0.12M, but the accuracy of our model is higher. Besides, LF-MOGP reduces the classification error by 1.68% compared to EvoCNN, while the number of parameters is reduced by 1.24M, which is very promising, and the classification error is reduced by 1.6% compared to SEECNN.

C. CIFAR10 and CIFAR100
The comparison results of LF-MOGP against the competitors are presented in Table 5. For CIFAR10, the classification errors of LF-MOGP are lower than that of all the handcrafted models. Besides, the best CNN evolved by LF-MOGP has a smaller number of parameters, which demonstrates the significant superiority of the proposed algorithm over the handcrafted models. In comparison to the ENAS methods on CIFAR10, the proposed LF-MOGP is not inferior, except that the classification error of LF-MOGP is 0.72% higher than that of PNAS, while the running time is only 4% of PNAS.
Regarding the GPU Days and parameters, the best results of the ENAS algorithm are 1.65 and 0.7M obtained by FPSO. However, its accuracy is not very promising, due to the fact that its classification error is 2.15% higher than that of our LF-MOGP.
In addition to PNAS, the top five ENAS algorithms with the minimal classification errors are CNN-GA, Genetic CNN, Large-scale Evolution, EvoCNN, NATS-Bench, and their classification errors are 4.78%, 5.01%, 5.40%, 5.47%, 5.63%, respectively. Compared with them, the proposed LF-MOGP can further reduce the classification error by 0.65%, 0.88%, 1.27%, 1.34% and 1.50%, respectively. The CNN evolved by LF-MOGP is more lightweight with only 1.07M parameters, which is conducive to extending the algorithm to practical applications. Moreover, LF-MOGP also has a significant advantage in terms of 'GPU Days'. Compared with the above ENAS algorithms, the proposed LF-MOGP algorithm takes the shortest running time to search for the optimal CNN architecture except for FPSO.
For CIFAR100, the best error obtained by LF-MOGP is about 26.37%. Compared with the Handcrafted algorithms, LF-MOGP can outperform all of them except DenseNet. More specifically, although the classification error of LF- 44M 100 † Best classification error for handcrafted neural network models available on Fashion dataset, which uses the cutout and random erasing data augmentation techniques during training MOGP is a little higher than that of DenseNet, the number of parameters of the model obtained by our LF-MOGP is only one-third of that obtained by DenseNet. Compared with the ENAS algorithms, the classification error of LF-MOGP is slightly inferior to those obtained by CNN-GA, Largescale Evolution and Genetic CNN. However, the number of parameters of our evolved model is also much less than those obtained by the three algorithms. Moreover, our LF-MOGP can reduce the error by 0.77%, 0.74%, and 0.12% compared to ME-HDSS, MetaQNN, and NATS-Bench algorithms, respectively. Thus, it can be seen that our LF-MOGP is still very competitive with these state-of-the-art algorithms on the CIFAR100 dataset.

Effectiveness regarding the leader-follower mechanism
The leader-follower mechanism (details in Sect. 3.1) is an important strategy proposed in this study. To verify its necessity and effectiveness, further experiments are carried out on CIFAR10 in this section. The comparison of the evolutionary behavior of LF-MOGP with and without the leader-follower mechanism is illustrated in Fig. 10.) As can be seen from Fig. 10a, the Pareto front obtained by LF-MOGP is much superior to that obtained by the one without the leader-follower mechanism. In addition, according to Fig. 10b, the evolutionary process of the mean ACC of top 5 individuals obtained by LF-MOGP also outperform that obtained by the one without the leader-follower mechanism.
The experiment results show that the leader-follower mechanism is effective and can guide the search to more promising regions in the search space.

Evolutionary behavior
To analyze the evolutionary behavior (especially convergence) of the proposed LF-MOGP, we show the evolutionary process of the Pareto front obtained by our algorithm in Fig. 11, using the MBI dataset as an example. The figure contains three parts, namely the evolutionary process of Pareto front (with a sampling period of 20 generations), and the two convergence trajectories of the ACC and complexity (number of parameters) metrics obtained by our algorithm.
As can be seen from the Fig. 11a, the quality of the Pareto front steadily improves as the number of generations increases, with the highest accuracy of 97.6% obtained by the non-dominated solutions on the validation set. According to Fig. 11b, both the mean ACC of the population and the best ACC of the individual show a steady increase against the generation, and tend to converge at the 80th generation. Similarly, from Fig 11c, it appears that the parameters of the individual with the best accuracy, as well as the mean parameters of the population, decreases obviously with the increase of generation, and their trajectories also tend to converge at around the 80th generation.

Discussion
In summary, LF-MOGP achieves promising performance on the eight datasets. Compared with the ENAS algorithms, Table 5 The best classification error rates, parameters, GPU Days, and experimental environment of LF-MOGP and its competitors on CIFAR10 and CIFAR100 * For the parameter, it is same in CIFAR10 and CIFAR100 for the handcrafted model, but the ENAs algorithm will get different models on different datasets, so there will be two parameter quantities, of which the left and right are the parameters of CIFAR10 and CIFAR100 respectively LF-MOGP performed best on five datasets, second best on two datasets, and third best on one dataset. The advantages are more obvious when compared with the Handcrafted algorithms. The main reasons for the good performance of LF-MOGP can be analyzed as follows:

Method
(1) Benefit from the encoding strategy of variable lengthwidth designed in LF-MOGP. This encoding strategy is conducive to feature extraction and ensures that richer features can be extracted, and rich features are the basic guarantee for completing the classification tasks. In addition, LF-MOGP is essentially a block-based algorithm, while CGP is a tree-like structure with fewer structural constraints on the CNNs, so encoding the blocks with CGP can provide a larger search space for the algorithm.

Evolved CNN architectures
The best CNNs evolved by LF-MOGP on the seven benchmark datasets are presented in Fig. 12. Based on these best CNNs, the following conclusions can be drawn. First, the depth and width of these CNNs are different on different datasets, which indicates that LF-MOGP can design CNNs in a targeted manner according to the data characteristics of different tasks. Second, it breaks the limitations of traditional CNNs construction. For example, the fully connected layer can also use features from previous layers, which is significantly different from the classical fully connected layer that only completes the classification task based on the features obtained in the last layer, which improves the utilization of features. In addition, in conventional CNNs, pooling after convolution is a regular operation, whereas here they can be performed simultaneously, which makes the scale of features richer and thus the CNNs constructed by LF-MOGP become more flexible and powerful for feature extraction and utilization. Finally, the best CNNs evolved by LF-MOGP on the seven benchmark datasets are relatively lightweight, and thus it is more friendly to resources and hardware. Such CNNs are more suitable for practical applications such as applications on mobile devices.

Convergence performance of the best CNNs
To better understand the convergence performance of the best CNNs evolved by LF-MOGP, we plot the trajectories of classification accuracy and loss value on the validation set during the training process. Note that here is the validation set instead of the test set, because the test set is not allowed to participate in the training process. To organize the paper properly, the convergence curves of six representative CNNs with the best performance are provided in Fig. 13. From the observation of the convergence curves, the following conclusions can be drawn. First, these CNNs can converge within 100 epochs, so the convergence rate is relatively fast. Second, the convergence curves indicate that the settings of the relevant hyperparameters are feasible to ensure that each model can be adequately trained, such as the number of training epochs. In addition, the settings of the learning rate and its adjustment strategy are reasonable to enable the CNNs to achieve convergence as soon as possible.

Real-world application
Nowadays, intelligent methods based on computer vision have been widely used in industrial applications [52], so in this section, LF-MOGP is further validated on the online classification of the real-world industrial slab numbers. The slab number is used as a unique identification for each slab in the hot rolling process to help the site operator to put the slab into the designated heating furnace for heating and then complete the hot rolling production. Figure 14 shows an example of industrial slab number image data, where the sequence of slab numbers is on the left and the segmented slab number characters images are on the right. At present, the identification of slab numbers is done by operators 24 h a day, which is labor-intensive and inefficient, and any mistake will cause serious economic losses to the production line. Therefore, it is important to design a stable and efficient intelligent slab number identification algorithm to reduce labor costs and improve production efficiency. Table 6 shows the classification results of the proposed LF-MOGP and several comparative algorithms on the realworld slab dataset. As can be seen from Table 6, the best CNN architecture evolved by LF-MOGP achieves the highest classification accuracy of 98.07% without a significant increase in the number of parameters compared to the rival algorithms. In fact, the number of parameters is less than those obtained by VGG and ResNet. The application scenarios for real-world problems are generally limited computing resources. Therefore, the evolved CNNs have a very high practical application value with high accuracy and low complexity.
The identification results for some practical slab numbers are presented in Fig. 15, where the characters marked in red mean misidentified numbers. From this figure, it can be seen that the CNN evolved by LF-MOGP can identify the slab number correctly in most cases even for those slabs with quite low quality characters. Such slab numbers are also very hard to distinguish by experienced human experts. It is worth noting that LF-MOGP is designed for singlelabel image classification. When dealing with slab number sequence recognition, our method can first recognize the segmented character images with a batch size of 10 without disturbing the order, and then reconvert the recognition result into a slab number sequence.

Conclusion and future work
In this paper, a CGP-based autonomous evolutionary convolutional neural network search algorithm (LF-MOGP) was proposed to evolve good CNNs for image classification tasks. In this algorithm, a flexible variable length-width encoding strategy was designed based on CGP and 22 basic functional blocks, which can help to expand the search space. To achieve convergence acceleration and reduce computational resources, a leader-follower strategy was proposed to guide the evolution process. The proposed LF-MOGP is tested on eight benchmark datasets and a real-world industrial dataset, and the experimental results illustrated that LF-MOGP outperformed 35 existing algorithms in the literature in terms of classification accuracy, model complexity, and computa- Images tional resource requirement. Since the CNNs constructed by LF-MOGP are relatively lightweight, it has greater potential for industrial applications, which is our main future work.