RL based hyper-parameters optimization algorithm (ROA) for convolutional neural network

Many real-world applications necessitate optimization in dynamic situations, where the difficulty is to locate and follow the optima of a time-dependent objective function. To solve dynamic optimization problems (DOPs), many evolutionary techniques have been created. However, more efficient solutions are still required. Recently, a new intriguing trend in dealing with optimization in dynamic environments has developed, with new reinforcement learning (RL) algorithms predicted to breathe fresh life into the DOPs community. In this paper, a new Q-learning RL-based optimization algorithm (ROA) for CNN hyperparameter optimization is proposed. Two datasets were used to test the proposed RL model (MNIST dataset, and CIFAR-10 dataset). Due to the use of RL for hyperparameter optimization, very competitive results and good performance were produced. From the experimental results, it is observed that the CNN optimized by ROA has higher accuracy than CNN without optimization. When using the MNIST dataset, it is shown that the accuracy of the CNN optimized by ROA when learning 5 epoch is 98.97%, which is greater than the 97.62% of the CNN without optimization. When using the CIFAR-10 dataset, it is shown that the accuracy of the CNN optimized by ROA when learning 10 epoch is 73.40 percent, which is greater than 71.73% of the CNN without optimization.


Introduction
People can recognize items in their environment and objects in less than a second. Humans have been taught to recognize things since they were children. Similarly, if computers can detect or categorize objects and environments by scanning for low-level characteristics like edges and curves, they may use a sequence of convolutional layers to develop more abstract conceptions of what they see. Convolutional Neural Networks are used in neural networks to recognize and classify images (CNN). Image recognition is used in a variety of applications (Sehgal et al. 2019). Deep Neural Networks are responsible for most of the success in this area. While deep networks have enabled numerous intriguing and valuable applications, there are still several challenges to solve Calabrese et al. (2020). One impediment is the lack of an analytical method for determining the appropriate design for a deep network for tackling various issues. For many articles, writers design multiple distinct network topologies before deciding on the optimal one to utilize. This causes a knowledge and effort load in determining the appropriate architecture. Manually selecting and executing these architectures might be time intensive (Duong et al. 2022).
The implementation of CNN necessitates a set of settings that are independent of the data and that the machine learning researcher must manually modify. Hyperparameters are variables that affect the network structure and CNN's trained network . The hyperparameter optimization challenge also includes finding a collection of hyperparameters that produces an accurate model in an acceptable amount of time. The task of identifying a suitable model of hyperparameter or the problem of optimizing a loss function across a graph-structured configuration space is known as hyperparameter optimization. It can be computationally costly to test every potential set of hyperparameter models.

3
As a result, the demand for an automated and organized search method is growing, and hyperparameter space, in general, is expanding.
Hyperparameter optimization is used to increase the accuracy of neural networks. which has a lot of real-world applications such as signature verification and handwriting analysis, healthcare, traveling salesman problem, image compression, stock exchange prediction, computer vision, speech recognition, and natural language processing (Abiodun et al. 2018, Thanga et al. 2021. The typical method of accomplishing hyperparameter optimization has been grid search (Chicco et al. 2017) or parameter sweep, which is an exhaustive search of a manually chosen subset of a learning algorithm's hyperparameter space. A grid search algorithm must be directed by a performance metric, which is commonly assessed by crossvalidation on the training set or assessment on a held-out validation set. Figure 1 illustrates that Grid search using two hyperparameters with varying values. Each hyperparameter is assessed and compared with ten distinct values, for a total of 100 possible combinations. Blue outlines represent places with strong outcomes, while red contours represent regions with low results.
Random Search (Freitas et al. 2016) is also a hyperparameter optimization approach, it is substituting the exhaustive enumeration of all combinations with a random selection of them. This applies to the discrete environment mentioned above, but it also applies to continuous and mixed areas. It can outperform Grid search, especially when just a limited number of hyperparameters impact the machine learning algorithm's ultimate performance. The optimization problem is said to have a low inherent dimensionality in this situation. Random Search is also embarrassingly parallel, and it allows past information to be included by selecting the distribution from which to sample. Figure 2 illustrates that For two hyperparameters, do a random search across possible combinations of values. In this example, 100 distinct random options are considered. When compared to a grid search, the green bars indicate that more individual values for each hyperparameter are examined.
On the other hand, Bayesian optimization (Thornton et al. 2013) is a global optimization strategy for noisy black-box functions. Bayesian optimization, when used to hyperparameter optimization, creates a probabilistic model of the function mapping from hyperparameter values to the objective as assessed on a validation set. Bayesian optimization seeks to gather observations exposing as much information about this function and, in particular, the position of the optimum by repeatedly assessing a potential hyperparameter configuration based on the existing model and then updating it. Figure 3 illustrates that methods like Bayesian optimization intelligently examine the universe of alternative hyperparameter options by determining which combination to investigate next based on past discoveries.
Rather than manually tuning these hyperparameters, a novel technique based on reinforcement learning (Minaee et al. 2021) is presented to tune hyperparameters for a convolutional neural network. Instead of requiring a researcher to manually modify hyperparameter knobs in order to gradually   converge to an ideal solution, this work automates the process and allows an asynchronous reinforcement learning algorithm to automatically alter hyperparameters and discover an optimal configuration. When the algorithm has finished executing, the architecture is ready for usage. With so many presents and future image recognition applications, being able to quickly find appropriate network designs is crucial. In this paper, we will introduce a new Q-learning RL-based Optimization Algorithm (ROA) for CNN hyperparameters optimization. RL overcomes the limitations of the traditional evolutionary techniques. From the experimental results, it is observed that the CNN optimized by ROA has higher accuracy than CNN without optimization.
The following is how the rest of the paper is structured: the recent related works in CNN hyperparameter optimization techniques are detailed in Sect. 2 in Sect. 2. The problem definition is described in Sect. 3. The findings and discussion are presented in Sect. 4 of the report. In Sect. 5, we talk about our conclusion.

Background
This section discusses the background knowledge of deep learning, advantages of deep learning, Convolutional Neural Networks, and Reinforcement Learning.

Deep learning
Deep learning is a type of machine learning technique that employs numerous layers to extract higher-level characteristics from raw input In image processing, for example, lower layers may recognize boundaries, while higher layers may identify concepts meaningful to humans, such as digits, characters, or faces (Tahir et al. 2021, Yuan et al. 2020. The majority of modern deep learning models are based on artificial neural networks, especially convolutional neural networks (CNNs), though they can contain probabilistic formulas or latent variables structured layer-wise in deep generative models like Bayesian networks and deep Boltzmann devices. Each level of deep learning learns to turn the data it receives into a little more complex and composite representation; The second layer may create and encode edge configurations; the third layer may encode a nose and eyes, and the fourth layer may identify the presence of a face in the image. Furthermore, a deep learning process can figure out which traits belong at which level by itself. This does not eliminate the necessity for hand-tuning; for example, adjusting the number of levels and the size of the layers can yield variable degrees of abstraction (Minaee et al. 2021;Luo et al. 2017a, b).

Advantages of deep learning (DL)
Because of its powerful automatic representation capabilities, deep learning has achieved significant discoveries in a variety of domains (Ren et al. 2021). The importance of neural architecture design in data feature representation and final performance has been demonstrated. The neuronal architecture, on the other hand, is strongly reliant on the researchers' existing knowledge and experience. People find it challenging to break out from their original thinking paradigm and develop an optimal model due to the constraints of human intrinsic knowledge. As a result, it seems logical to limit human interaction as much as possible and let the algorithm develop the neural architecture on its own. The Neural Architecture Search (NAS) algorithm is a game-changing technique, and the research surrounding it is complex and extensive. As a result, a thorough and systematic survey of the NAS is required. Like in (Yan et al. 2021), authors turn to neural architecture search (NAS) and aim to integrate NAS approaches into the ZSL world for the first time.

Convolutional neural networks (CNN)
The term "convolutional neural network" refers to the network's use of a mathematical procedure known as convolution. Convolutional networks are a subset of neural networks that employ convolution instead of generic matrix multiplication in at least one layer (Goodfellow et al. 2016;Luo et al. 2017a, b). As illustrated in Fig. 4, the convolutional neural networks consist of three layers. An input layer, hidden layers, and an output layer make up a convolutional neural network. Any intermediary layers in a feed-forward neural network are referred to be hidden since the activation function and final convolution hide their inputs and outputs. The convolution layer computes neurons' output related to particular regions in the input volume, with each computing a dot product between their weights and a tiny region in the input volume to which they are connected (Chang et al. 2015). By executing a down-sampling process along the spatial dimensions, pooling reduces the number of retrieved features. The dense layers are responsible for calculating either the hidden convolutions or the class scores.

Reinforcement learning (RL)
Reinforcement learning tries to educate an agent on how to complete a task by allowing the agent to explore and experience the environment while maximizing a reward signal. It differs from supervised machine learning in that the algorithm learns from a set of samples labeled with the right responses. One advantage of reinforcement learning over supervised machine learning is that the reward signal may be generated without prior knowledge of the proper path of action, which is especially beneficial if such a dataset does not exist or is impractical to gather. While reinforcement learning may appear to be comparable to unsupervised machine learning at first look, they are not. Unsupervised machine learning seeks to discover some (hidden) organization inside a dataset, whereas reinforcement learning does not seek to discover structure in data (Yamauchi et al. 2020). Reinforcement learning, on the other hand, seeks to educate an agent on how to accomplish a task through incentives and experiments. There are two types of algorithms in reinforcement learning: value-based algorithms and policy-based algorithms. Value-based algorithms attempt to approximate or uncover the value function that provides a reward value to state-action pairings. These reward values can then be included in a policy. The raw input in an image recognition application could be a matrix of pixels; the first layer may extract the pixels and detect edges ).

Related work
The most current works for CNN hyperparameter optimization, such as grid search, Bayesian optimization, Genetic Algorithm (GA), and random search, are discussed in this section.
For each hyperparameter setting on a defined range of values, the grid search method is a trial-and-error method. The use of a grid search has the advantage of being easily parallelized (Bergstra et al. 2012). The boundaries and steps between values of hyperparameters will be specified by researchers and practitioners, resulting in a grid of configurations . However, if one task fails, the others will follow suit. In most circumstances, a machine learner will start with a small grid and subsequently expand it, making it more efficient to set the best grid while searching for a new one ). Due to dimensionality constraints, four hyperparameters will become unworkable as the number of functions to assess grows with each additional parameter.
The random search method "randomly" samples the hyperparameter space. According to Bergstra et al. (2012) random search provides greater advantages than grid search in terms of applications that can continue to run even if the computer cluster fails. It allows practitioners to adjust the "resolution" on the fly, as well as add additional trials to the set or even skip the fail test entirely. Simultaneously, the random search process can be stopped at any time, forming a complete experiment that can be run in parallel (Bergstra et al. 2011). Furthermore, if more computers become available, a new trial can be added to the experiment without endangering the results (Bergstra et al. 2013).
Bayesian optimization is another recent advancement in hyperparameter tuning. It employs the Gaussian Process, which is a distribution over functions. To train with the Gaussian Process, it is necessary to fit it to the given data, as it will generate a function that closely resembles the data. The Gaussian process will optimize the predicted improvement and surrogate the model which is the likelihood of the new trial and will enhance the current best observation in Bayesian optimization. The next step will be to determine the largest expected improvement, which can be done at any point in the search space. Spearmint, which uses the Gaussian process, is a widely used example of Bayesian optimization (Snoek et al. 2015). Bergstra et al. (2012) contend that the Bayesian optimization method is constrained because it works with high-dimensional hyperparameters and is computationally expensive. As a result, it performs poorly. Bayesian optimization (BO) works by fitting a probabilistic model to the data and then utilizing that model as a cheap proxy to select the next most promising place to assess. The Gaussian Process Regressor, Bayesian Neural Network (Springenberg et al. 2016), and Random Forest Regressor are some of the proxy models that have been proposed.
Genetic algorithms use a binary representation of individuals (each individual is a string of bits), making mutation and crossover easier to implement. Such processes generate On the other hand, evolutionary algorithms rely on specific data structures and require carefully crafted mutation and crossover, which is strongly reliant on the situation at hand (Chiong et al. 2007). When there is no knowledge of the gradient function at assessed sites, genetic algorithms can be applied according to the author (Rojas et al. 1996). When there are multiple local minima or maxima, it can produce good results. Unlike any other search method, the function is determined in multiple places concurrently rather than in a single location. They can be done on many processors because the function calculations on all points of a population are independent of one another (Muhlenbein et al. 1991). They can also be easily parallelized, allowing numerous approaches to the optimal to be processed in parallel. (Xiao et al. 2020) employed a variable-length genetic algorithm in 2020 to systematically tweak the hyperparameters of CNN to increase performance. This work includes a detailed comparison of several optimization methods such as random search, large-scale evolution, and traditional genetic algorithms. (Liashchynskyi et al. 2019) conducted a rigorous evaluation of optimization strategies such as random search, grid search, and genetic algorithm. These algorithms were utilized to create the Conventional Neural Network by the authors. The dataset for their research is the CIFAR-10 dataset with augmentation and pre-processing procedures. Grid search, according to the authors' experience, is not ideal for huge search spaces. When there is a huge search space and too many parameters to optimize, the authors recommend using the genetic algorithm. Adrian Catalin et al. suggested a new form of a random search for hyperparameter optimization for machine learning algorithms in 2020 (Andonie et al. 2020). This improved random search version creates new values with a likelihood of change for each hyperparameter. The proposed variant of random search outperforms the traditional random search approach. The authors used this for optimizing the CNN hyperparameters. This can be used for any optimization problem in the discrete domain.
Authors in (Li et al. 2018a, b) present a flexible embedding approach for a rank-constrained SC. To recover the block-diagonal affinity matrix of an ideal graph, an adaptive probabilistic neighborhood learning approach is used. The suggested algorithm's performance is demonstrated by experimental findings on both synthetic and real-world data sets. In Sebe et al. (2018), the authors suggested a new method for assigning affinity weights to data points on a per-data-pair basis. The proposed method is effective in learning the affinity network while also fusing characteristics, resulting in better clustering results. A unique Event-Adaptive Concept Integration algorithm is developed, which uses different weights to measure the efficiency of semantically related concepts (Xu et al. 2019). Authors in Yu et al. (2018) employ semantic regression to increase the nearby link between data with similar semantics. To address the data ambiguity problem, the authors of Zhang et al. (2019) loosen the usual ranking loss and propose a unique deep multi-modal network with a top-k ranking loss.

Problem definition
This section discusses the selected hyperparameters and the used methodology. The proposed algorithm relies on using the Reinforcement Learning (RL) algorithm to optimize the CNN hyperparameters. Firstly, we explain the used CNN hyperparameters in Sect. 4.1. Secondly, we explain the proposed Q-learning RL-based Optimization Algorithm (ROA) in Sect. 4.2 in detail.

Selected CNN hyperparameters
It can be difficult to define model architectures because there are so many design options. The author does not know what the best model architecture for a given model is right now. As a result, the purpose of this article is to investigate a variety of options. An actual machine learner will instruct the machine to conduct this investigation and automatically create the best model architecture. The hyperparameters are variables in the configuration that are external to the model and whose value cannot be predicted from the data. There are two different types of hyperparameters: (i) Hyperparameter that determines the network structure such as ( Table 1. Each hyperparameter is given a more concise and understandable name (abbreviation). In addition, ranges are denoted using square brackets. The network-trained hyperparameters are listed in Table 2.
Many real-world applications necessitate optimization in dynamic situations, where the difficulty is to locate and follow the optima of a time-dependent objective function. To solve Dynamic Optimization Problems (DOPs), many evolutionary techniques have been created. However, more efficient solutions are still required. Recently, a new intriguing trend in dealing with optimization in dynamic environments has developed, with new Reinforcement Learning (RL) algorithms predicted to breathe fresh life into the DOPs community. In this paper, a new Q-learning RL-based Optimization Algorithm (ROA) for CNN hyperparameter optimization is proposed.

RL based optimization algorithm (ROA)
RL is an AI technique in which an agent performs a task in order to be rewarded. The agent receives the current status of the environment and takes an appropriate response. The action made causes a change in the environment, and the agent is notified of the change through a reward. The ROA learns to select the optimal values for the selected parameters. The ROA learns to select the optimal values for the selected parameters. ROA is used to optimize the hyperparameters for CNN. The overall steps of ROA are shown in Algorithm 1. The ROA contains an agent called Optimization agent (OA) as shown in Fig. 5.

Input:
o The data in the PT

Output:
o The optimized values for each hyperparamert in PT.

Symbol
Meaning PT Parameters Table  CNN Convolutional Neural Network Steps: Update the values in the Q-table 10: //Update the data in PT 11: The OA updates the data in the PT (update the values for each hyperparamert in PT) 12: Next . .

.
As shown from Fig. 5, The ROA is used to update the values of the hyperparameters in the parameter Table (PT) which is a table used to save the values of all hyperparameters (either Network Structure Hyperparameters (NSH) or Network Trained Hyperparameters (NTH)).
The overall steps of the RL-based Optimization Algorithm (ROA) are as follows: (i) For each hyperparameter, create a Q-table containing two columns (state and action). The state is described as the overall data in the parameter table, and the action is described as the selection of the best

Used datasets
We experiment with two datasets in this study. The MNIST dataset is used in the first experiment, and the CIFAR-10 dataset is used in the second. We also calculate the training time and testing time in case of using each dataset. Firstly, in case of using MINIST dataset, the training time = 21.315921783447266 s and the testing time = 0.5247492790222168 s. Secondly, in case of using CIFAR-10 dataset, the training time = 803.1955409049988 s and the testing time = 4.01332426071167 s. A sample of each dataset is shown in Fig. 6.

Experiment using MNIST dataset
MNIST is a handwritten number image dataset that contains 60,000 images for learning and 10,000 for testing. From 0 to 9 proper labels are assigned to each image. In the experiment, optimization was carried out with ROA every 5 epochs, and learning was carried out using the parameters gained. Each experiment is repeated 30 times to obtain an average value (Table 3). Figure 7 illustrates the accuracy of the CNN without optimization vs. the accuracy of the CNN optimized by ROA using MNIST dataset. Figure 8 illustrates the loss of the CNN without optimization vs. the loss of the CNN optimized by ROA using MNIST dataset.
From Table 3 and Fig. 7, it is shown that the accuracy of the CNN optimized by ROA when learning 5 epoch is 98.97 percent, which is greater than the 97.62 percent of the CNN without optimization. When examining any epoch, it is discovered that the CNN optimized by ROA has higher accuracy than the CNN without optimization.

Experiment using CIFAR-10 dataset
The CIFAR-10 dataset consists of 50,000 learning images and 10,000 testing images. It has a 10-class label and is frequently used as a standard for object recognition. In the experiment, optimization is done with ROA every 10 epochs, and learning is done with the parameters that are enhanced. Each experiment is repeated 30 times to obtain an average value. Table 4 illustrate the accuracy and Loss.  Figure 9 illustrates the accuracy of the CNN without optimization vs. the accuracy of the CNN optimized by ROA using CIFAR-10 dataset. Figure 10 illustrates the loss of the CNN without optimization vs. the loss of the CNN optimized by ROA using CIFAR-10 dataset.
From Table 4 and Fig. 9, it is shown that the accuracy of the CNN optimized by ROA when learning 10 epoch is 73.40 percent, which is greater than the 71.73 percent of the CNN without optimization. When examining any epoch, it is discovered that the CNN optimized by ROA has higher accuracy than the CNN without optimization.

Conclusions
In this paper, a new Q-learning RL-based Optimization Algorithm (ROA) for CNN hyperparameter optimization was proposed. Two datasets were used to test the proposed RL model (MNIST dataset, and CIFAR-10 dataset). Due to the use of RL for hyperparameter optimization, very competitive results and good performance were produced. RL overcomes the limitations of the traditional evolutionary techniques. RL algorithms are predicted to breathe fresh life into the DOPs community. From the experimental results, it is observed that the CNN optimized by ROA has higher accuracy than CNN without optimization. When using MNIST dataset, it is shown that the accuracy of the CNN optimized by ROA when learning 5 epoch is 98.97 percent, which is greater than the 97.62 percent of the CNN without optimization. When using CIFAR-10 dataset, it is shown that the accuracy of the CNN optimized by ROA when learning 10 epoch is 73.40 percent, which is greater than the 71.73 percent of the CNN without optimization.  Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are Fig. 9 Accuracy without optimization vs. accurracy with ROA using CIFAR-10 dataset Fig. 10 Loss without optimization vs. loss with ROA using CIFAR-10 datasett included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.