Advertisement

Journal of Membrane Computing

, Volume 1, Issue 4, pp 279–291 | Cite as

Hyperparameter optimization in learning systems

  • Răzvan AndonieEmail author
Survey Paper
  • 268 Downloads

Abstract

While the training parameters of machine learning models are adapted during the training phase, the values of the hyperparameters (or meta-parameters) have to be specified before the learning phase. The goal is to find a set of hyperparameter values which gives us the best model for our data in a reasonable amount of time. We present an integrated view of methods used in hyperparameter optimization of learning systems, with an emphasis on computational complexity aspects. Our thesis is that we should solve a hyperparameter optimization problem using a combination of techniques for: optimization, search space and training time reduction. Case studies from real-world applications illustrate the practical aspects. We create the framework for a future separation between parameters and hyperparameters in adaptive P systems.

Keywords

Hyperparameters Membrane computing Spiking neural P system P system Neural computing Machine learning 

1 Introduction

The main building element within P systems is the membrane. A membrane is a discrete unit which can contain a set of objects (symbols/catalysts), a set of rules, and a set of other membranes contained within [37, 36]. Essentially, a P system is a membrane structure. To what extend can such a structure adapt to a given problem? Can it be trained in a similar way we train machine learning models or, more specifically, neural networks?

There is a recent tendency to create bridges between adaptive P systems and machine learning learning paradigms, motivated by the huge success of deep learning in current technologies. Some recent P systems are adaptive [13, 55], and [3]. During the Brainstorming Week on Membrane Computing Sevilla, 2018, ideas about evolving spiking neural (SN) P systems (introduced in [27]) were discussed1. A first attempt to use SN P systems in pattern recognition can be found in [49], where SN P systems were reported to outperform back propagation and probabilistic neural networks.

There are things we have to clarify. For instance, we should distinguish between trainable parameters and hyperparameters of a model. In the above-cited references, the plasticity of P systems refers to the parameters only. For example, in [55], a spiking neural P system with self-organization has no initially designed synapses. The synapses can be created or deleted according to the information contained in involved neurons during the computation. In this case, the synapses are parameters, but not hyperparameters.

Many variants of SN P systems have been proposed [44]: with anti-spikes, with weights, with thresholds, with rules on synapses, with structural plasticity, with learning functions, with polarization, with white hole neurons, with astrocytes, etc. For a given problem, which of these models is the most efficient? In this case, the type of the spiking neuron used is a hyperparameter which has to be instantiated by optimization before the P systems evolves and adjusts its parameters. In supervised machine learning, in contrast to P systems, the difference between parameters and hyperparameters of models was intensively studied.

The motivation of our paper is to create the framework for a future separation between parameters and hyperparameters in adaptive P systems. We present an integrated view of methods used in hyperparameter optimization of learning systems in general, with an emphasis on computational complexity aspects. Case studies from real-world applications will illustrate the practical aspects.

As a first step, we have to define more precisely what we understand by “hyperparameter” in the context of machine learning models. Nearly, all model algorithms used in machine learning have a set of tuning hyperparameters which affect how the learning algorithm fits the model to the data. These hyperparameters should be distinguished from internal model parameters, such as the weights or the rule probabilities, which are adaptively adjusted to solve a problem. For instance, the hyperparameters of neural networks typically specify the architecture of the network (number and type of layers, number and type of nodes, etc).

The model (the framework) itself may be considered, at a meta-level, a hyperparameter. In this case, we have a list of possible models, each with its own list of hyperparameters, and all have to be optimized for a given problem. For simplicity, we will discuss here only optimization of hyperparameters within the framework of a given model. Many optimizers attempt to optimize both the choice of the model and the hyperparameters of the model (e.g., Auto-WEKA, Hyperopt-sklearn, AutoML, auto-sklearn, etc).

The aim of hyperparameter optimization is to find the hyperparameters of a given model that return the best performance as measured on a validation set. This process can be represented in equation form as:
$$x^{*} = \arg \min _{{x \in \aleph }} f(x),$$
(1)
where f(x) is an objective function to minimize (such as RMSE or error rate) evaluated on the validation set; \(x^*\) is the set of hyperparameters that yields the lowest value of the score, and x can take on any value in the domain \(\aleph \). In simple terms, we want to find the model hyperparameters that yield the best score on the validation set metric.
More detailed, Eq. (1), can be written as:
$$x^{*} = \arg \min _{{x \in \aleph }} f(x,\gamma ^{*} ;S_{{{\text{validation}}}} ).$$
(2)
Equation (2) includes an inner optimization used to find \(\gamma ^*\), the optimal value of \(\gamma \), for the current x value:
$$\gamma ^{*} = \arg \min _{{\gamma \in \varGamma }} f(x,\gamma ;S_{{{\text{train}}}} ).$$
(3)
\(S_{{{\text{validation}}}}\) and \(S_{{{\text{train}}}}\) denote the validation and training datasets, respectively; \(\gamma \) is the set of learned parameters in the domain \({\varGamma }\) through minimization of the training error.

The validation process is more complex than described here. Each combination of hyperparameter values may result in a different model and Eq. (1) evaluates and compares the objective function for different models. The process of finding the best-performing model from a set of models that were produced by different hyperparameter settings is called model selection [41].

For simplicity, in Eq. (1), we referred to a “validation set”, without further details. We have to evaluate the expectation of the score over an unknown distribution and we usually approximate this expectation using a three-way split, dividing the dataset into a training, validation, and test dataset. Having a training–validation pair for hyperparameter tuning and model selections allows us to keep the test set “independent” for model evaluation.

The recent interest in hyperparameter optimization is related to the importance and complexity of the deep learning architectures. The existing deep learning hyperparameter optimization algorithms are computationally demanding. For example, obtaining an optimized architecture for the CIFAR-10 and ImageNet datasets required 2000 GPU days of reinforcement learning [58] or 3150 GPU days of evolution [42].

There are two inherent causes of this inefficiency: one is related to the search space, which can be a discrete domain. In its most general form, discrete optimization is NP-complete.

The second cause is that evaluating the objective function to find the score is expensive: Each time we try different hyperparameters, we have to train a model on the training data, make predictions on the validation data, and then calculate the validation metric. This optimization is usually done by re-training multiple models with different combinations of hyperparameter values and evaluating their performance. We call this re-training + evaluation for one set of hyperparameter values a trial.

Basically, there are three computational complexity aspects which have to be addressed: (a) Choose an efficient optimizations method for Eq. (1); (b) Reduce the search space; and (c) Reduce the training time for each trial.

Whereas choosing a good optimization is a well-defined (but difficult) task, the problem of reducing the search space and the training time is also important. This can be done, for instance, by reducing the number of hyperparameters and features, reducing the training set, and using additional objective functions.

The number of trials increases generally exponential with the number of hyperparameters. Therefore, we would like to reduce the number of hyperparameters. For instance, hyperparameters can be ranked (and selected) based on the functional analysis of the variance of the objective function. This method is known as sensitivity analysis.

Reducing the number of trials is not enough if the training time for each trial is high. In this case, we should attempt to reduce (instance selection) the training set.

Using additional objective functions can also reduce the search space for the optimal hyperparameter configuration. For instance, we may add constrains about the complexity of a neural network (such as the number of connections).

We are ready now to describe the structure of the paper. Section 2 is an overview of some standard methods for hyperparameter optimization. Section 3 presents how the search space and the training time can be reduced by instance selection, hyperparameter/feature ranking and using additional objective functions. Section 4 lists the most recent software packages for hyperparameter optimization. Section 5 presents case studies based on three well-known machine learning models. Section 6 contains final remarks and open problems.

2 Methods for hyperparameter optimization

Hyperparameter optimization should be regarded as a formal outer loop in the learning process. In the most general case, such an optimization should include a budgeting choice of how many CPU cycles are to be spent on hyperparameter exploration, and how many CPU cycles are to be spent evaluating each hyperparameter choice (i.e. by tuning the regular parameters) [9].

A simple strategy for hyperparameter optimization is a greedy approach: investigate the local neighborhood of a given hyperparameter configuration: vary one hyperparameter at a time and measure how performance changes. The only information obtained with this analysis is how different hyperparameter values perform in the context of a single instantiation of the other hyperparameters. We cannot expect good results with this approach.

Fortunately, we do have more systematic approaches. We will review the most fundamental methods. Details, can be found, for instance, in [9].

2.1 Grid search

The most commonly used hyperparameter optimization strategy is a combination of Grid search (GS) and manual tuning. There are several reasons why manual search and grid search prevail as the state of the art despite decades of research into global optimization:
  • Manual optimization gives researchers some degree of insight into the optimization process.

  • There is no technical overhead or barrier to manual optimization.

  • Grid search is simple to implement and parallelization is trivial.

  • Grid search (with access to a compute cluster) typically finds a better solution than purely manual sequential optimization (in the same amount of time).

  • Grid search is reliable in low-dimensional spaces (e.g., 1-d, 2-d). For instance, grid search is relatively efficient for optimizing the two standard hyperparameters of SVMs (LibSVM does this).

GS suffers from the curse of dimensionality because the number of joint values grows exponentially with the number of hyperparameters. Therefore, GS is not recommended for the optimization of many hyperparameters.

2.2 Random search

Random search (RS) consists in drawing samples from the parameter space following a particular distribution for each of the parameters. Using the same number of trials, RS generally yields better results than GS or more complicated hyperparameter optimization methods. Especially in higher-dimensional spaces, the computation resources required by RS methods are significantly lower than for GS [31]. RS works best under the assumption that not all hyperparameters are equally important [11].

Other advantages of RS are:
  • The experiment can be stopped any time and the trials form a complete experiment. The key is to define a good stopping criterion, representing a trade-off between accuracy and computation time.

  • New trials can be added to an experiment without having to adjust the grid and commit to a much larger experiment.

  • Every trial can be carried out asynchronously. Therefore, RS methods are relatively easy to implement on parallel computer architectures.

  • If the computer carrying out a trial fails for any reason, its trial can be either abandoned or restarted without jeopardizing the optimization.

Recent attempts to optimize the RS algorithm are: Li’s et al. Hyperband [32], which speeds up RS through adaptive resource allocation and early-stopping; Domhan et al. [17], which have developed a probabilistic model to mimic early termination of sub-optimal candidate; and Florea et al. [20], where we introduced a dynamically computed stopping criterion for RS, reducing the number of trials without reducing the generalization performance.

To illustrate the efficiency of RS in high-dimensional spaces, we refer to the following real-world application. Using RS, we have introduced in [31] the first polynomial (in the size of the input and the number of dimensions) algorithm for finding maximal empty hyper-rectangles (holes) in data. All previous (deterministic) algorithms are exponential.

We used 5522 protein structures randomly selected from the Protein Databank, a repository of the atomic coordinates of over 100,000 protein structures that have been solved using experimental methods [56]. Proteins are three-dimensional dynamic structures that mediate virtually all cellular biological events. From the hyper-rectangles generated by our algorithm, we were able to determine which of the 39 dimensions in our data were most frequently the bounding conditions of the largest found hyper-rectangles.

Our algorithm only needs to examine a small fraction of the theoretical maximum of \(6.007576 \times 10^{104}\) possible hyper-rectangles. In a second stage, we were able to extract if/then rules from the hyper-rectangle output and found several interesting relationships among the 39-dimensional data.

2.3 Derivative-free optimization: Nelder–Mead

In hyperparameter optimization, we usually encounter the following challenges: non-differentiable functions, multiple objectives, large dimensionality, mixed variables (discrete, continuous, permutation), multiple local minima (multi-modal), discrete search space, etc. Derivative-free optimization refers to the solution of optimization problems using algorithms that do not require derivative information, but only objective function values.

Unlike derivative-based methods in a convex search space, derivative-free methods are not necessarily guaranteed to find the global optimum. Examples of derivative-free optimization methods are: Nelder–Mead (NMA), Simulated Annealing, Evolutionary Algorithms, and Particle Swarm Optimization (PSO).

The NMA was introduced [34] as early as 1965 and performs a search in n-dimensional space using heuristic ideas. The method uses the concept of a simplex, a structure in n-dimensional space formed by \(n+1\) not coplanar points.

NMA maintains a set of \(n+1\) test points arranged as a simplex and then modifies the simplex at each iteration using four operations: reflection, expansion, contraction, and shrinking. Each of these operations generates a new point. The sequence of operations performed in one iteration depends on the value of the objective function at the new point relative to the other points. It extrapolates the behavior of the objective function measured at each test point, to find a new test point and to replace one of the old test points with the new one, etc. If this point is better than the best current point, then it tries to move along this line. If the new point is not much better than the previous value, then the simplex shrinks towards a better point. Despite its age, the NMA search technique is still very search popular.

The NMA was used in Convolutional Neural Network (CNN) optimization in [1, 2], in conjunction with a relatively small optimization dataset. It works well for objective functions that are smooth, unimodal and not too noisy. The weakness of this method is that it is not very good for problems with more than about 10 variables; above this number of variables, convergence becomes increasingly difficult.

2.4 Bayesian optimization

RS and GS pay no attention to past results and keep searching across the entire search space. However, it may happen that the optimal answer lies within a small region. In contrast, Bayesian optimization iteratively computes a posterior distribution of functions that best describes the objective function. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not. By evaluating hyperparameters that appear more promising from past results, Bayesian methods can find better model settings than RS or GS in fewer iterations. A reviews of Bayesian optimization can be found in [45].

Bayesian optimization keeps track of past evaluation results which they use to form a probabilistic model mapping hyperparameters to a probability of a score on the objective function:
$$ P({\text{score}}|{\text{hyperparameters}}) $$
This model is called a surrogate for the objective function. The surrogate probability model is iteratively updated after each evaluation of the objective function.
Following Will Koehrsen2, we have the following steps:
  1. 1.

    Build a surrogate probability model of the objective function.

     
  2. 2.

    Find the hyperparameters that perform best on the surrogate.

     
  3. 3.

    Apply these hyperparameters to the true objective function.

     
  4. 4.

    Update the surrogate model incorporating the new results.

     
  5. 5.

    Repeat steps 2–4 until max iterations or time is reached.

     
Sequential model-based optimization (SMBO) methods are a formalization of Bayesian optimization. The sequential refers to running trials one after another, each time trying better hyperparameters and updating a probability model (surrogate).

There are several variants of SMBO methods that differ in steps 3 and 4, namely, how they build a surrogate of the objective function and the criteria used to select the next hyperparameters: Gaussian Processes, Random Forest Regressions, Tree Parzen Estimators, etc.

In low-dimensional problems with numerical hyperparameters, the best available hyperparameter optimization methods use Bayesian optimization [48]. However, Bayesian optimization is restricted to problems of moderate dimension [45].

3 Computational complexity issues

There are two complexity issues related to the search process in hyperparameter optimization:
  • A1. The execution time of each trial. This training phase for each trial depends on the size and the dimensionality of the training set.

  • A2. The complexity of the search space itself and the number of evaluated combinations of hyperparameters.

In case of deep learning we have both aspects: high-dimensional search space, which according to the curse of dimensionality principle, has to be covered with an exponential increasing number of points (in this case, combinations of hyperparameters); large training sets, which are typically used in deep learning.
To address these issues and reduce the search space, some standard techniques may be used:
  • Reduce the training dataset based on statistical sampling (relates to A1).

  • Feature selection (relates to A1).

  • Hyperparameter selection: detect which hyperparameters are more important for the neural network optimization. This may reduce the search space (relates to A2).

  • Beside the accuracy, use additional objective functions: number of operations, optimization time, etc). This may also reduce the search space (relates to A2). For instance, superior results were obtained by combining accuracy with visualization via a deconvolution network [1].

We will describe in the following how these techniques can be used.

3.1 Reduce the training dataset

Generally, training models with enough information is essential to achieve good performance. However, it is common that a training dataset T contains samples that may be similar to each other (that is, redundant) or noisy. This increases computation time and can be detrimental to generalization performance.

Instance selection aims to select subset \(S \subset T\), hoping that it can represent the whole training set and achieve acceptable performance. The techniques for instance selection are very similar to the ones in feature selection. For example, instance selection methods can either start with \(S = {\varPhi }\) (incremental method) or \(S = T\) (decremental method). As a result, this reduces training time. A review of instance selection methods can be found in [35].

Like in feature selection, according to the strategy used for selecting instances, we can divide the instance selection methods in two groups [35]:
  1. 1.

    Wrapper. The selection criterion is based on the accuracy obtained by a classifier (commonly, those instances that do not contribute with the classification accuracy are discarded from the training set).

     
  2. 2.

    Filter. The selection criterion uses a selection function which is not based on a classifier.

     
For hyperparameter optimization, since we evaluate different ML models, we are only interested in filter instance selection.

Several instance selection techniques have been introduced. For example, a very simple technique is to select one training sample at a time such that, when added to the previous set of examples, it results in the largest decrease in a squared error estimate criterion [39]. In [18], a measure of a sample’s influence on the classifier output is used to reduce the training set. Stochastic sampling algorithms were introduced in [47]. The progressive sampling method presented in [40] is an incremental method using progressively larger samples as long as model accuracy improves. Instance selection algorithms for CNN architectures were proposed in [1, 2, 28].

A related approach to instance selection is active learning which sequentially identifies critical samples to train on. For example, Bengio et al. suggest that guiding a classifier by presenting training samples in an order of increasing difficulty can improve learning [8]. They use the following observation: “Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones”. There are also other ways to order the training patterns (for instance, [16]). According to [8], these ordering strategies can be seen as a special form of transfer learning where the initial tasks are used to guide the learner so that it will perform better on the final task.

At the extreme, we may consider the order of the training sequence order as a hyperparameter of the model. However, finding the optimal permutation based on some performance criteria is computationally not feasible, since it is in the order of the possible permutations.

3.2 Reduce the number of features

Feature selection is a standard technique in machine learning [24]. By reducing the number of features we reduce training time. Feature selection is very similar to hyperparameter selection and similar techniques can be used to assess the importance of hyperparameters and features. Basically, we search for an optimal configuration of hyperparameters (or features), using trials to assess each configuration.

The search for the optimal joint combination of features and hyperparameters is computationally hard. A sequential greedy search in two stages (features selection followed by hyperparameter selection) is simpler. We omit in this case that the selection of features and hyperparameters are not independent processes, but this may be acceptable.

The problem of performing feature selection inside or outside the cross-validation loop (the loop which is doing the fine tuning of the parameters) was conducted in [43] and had very nuanced results. Ultimately, it depends on the dataset and the problem to be solved.

Some software packages (see Sect. 4) perform feature selection and hyperparameter optimization together. There are also some published results in which features + hyperparameters are optimized together for some specific models. For example, the SVM [51].

3.3 Hyperparameter selection by functional analysis of variance

If we can assess the importance of each hyperparameter, then we can also select and optimize only the most important hyperparameters of a model. This would reduce the complexity of hyperparameter optimization.

Optimally, we would like to know how all hyperparameters affect performance across all their instantiations. In most cases, such a general optimization is computationally prohibitive. A greedy approach to assess the importance of a hyperparameter is to vary one hyperparameter at a time and measure how this affects the performance of the objective function. The only information obtained with this analysis is how different hyperparameter values perform in the context of a single instantiation of the other hyperparameters.

A more elaborated analysis based on predictive models can be used to quantify the performance of a hyperparameter instantiation in the context of all instantiations of the other hyperparameters. For instance, we can use sensitivity analysis (SA) [29], described as follows. Once a network has been trained, calculate an average value for each hyperparameter. Then, holding all variables but one at a time at their average levels, vary the one input over its entire range and compute the variability produced in the net outputs. Analysis of this variability may be done for several different networks, each trained from a different weight initialization. The algorithm will then rank the hyperparameters from highest to lowest according to the mean variability produced in the output.

Three different sensitivity measures were proposed in [29], based on: output range, variance, and average gradient over all the intervals. Although the use of average values by all but one input does not capture all of the complex interactions in the input space, it does produce a rough estimate of the model’s univariate sensitivity to each input variable. Obviously, we do not consider the interactions between hyperparameters.

SA measures the effects on the output of a given model when the inputs are varied through their range of values. It allows a ranking of the inputs that is based on the amount of output changes that are produced due to variations in a given input. SA can be used both for hyperparameter and feature importance assessment.

A more sophisticated approach is based on the analysis of variance (functional ANOVA) [26]. It is presented as a software package (fANOVA3) able to approximate the importance of the hyperparameters of a model.

Recently introduced, the N-RReliefF algorithm [50] can also estimate the contribution of each single hyperparameter to the performance of a model. N-RReliefF was used to determine the importance of the interactions between hyperparameters on 100 data sets. The results showed that the same hyperparameters have similar importance on different data sets. This does not mean that only adjusting the most important hyperparameters and combinations is the best option in all cases. When there are enough computing resources, it is still recommended to optimize all hyperparameters [26]. However, for computational intensive optimizations, hyperparameter selection is a reasonable option.

3.4 Use additional objective functions

The performance of the objective function f (in Eq. 1) is usually measured in terms of the accuracy of the model. Beside accuracy, we may use additional objective function. The search could be also guided with goals like training time or memory requirements of the network.

One possibility is to add an objective function measuring the complexity of the model. Smaller complexity also means a smaller number of hyperparameters. We prefer models with small complexities. Models with lower complexity are not only faster to train. They also require smaller training sets (the curse of dimensionality) and have a smaller chance to over-fit (over-fitting increases with the number of hyperparameters).

For a CNN network, the complexity is the aggregation of the following hyperparameters’ values: number of layers, number of maps, number of fully connected layers, number of neurons in each layer. In case of the ConvNet neural network architecture implementation on mobile devices, beside accuracy, we may also consider the characteristics of the hardware implementation. Such metrics include latency and energy consumption [57].

4 Software for hyperparameter optimization

Several software libraries dedicated to hyperparameter optimization exist. Each optimization technique included in a package is called a solver. The choice of a particular solver depends on the problem to be optimized.

LIBSVM [14] and scikit-learn [38] come with their own implementation of GS, with scikit-learn also offering support for RS.

Bayesian techniques are implemented by packages like BayesianOptimization4, Spearmint5, and pyGPGO6.

Hyperopt-sklearn is a software project that provides automatic algorithm configuration of the scikit-learn library. It can be used for both model selection and hyperparameter optimization. Hyperopt-sklearn has the following implemented solvers [12]: RS, simulated annealing, and Tree-of-Parzen-Estimators.

Optunity7 is a Python library containing various optimizers for hyperparameter tuning. Optunity is currently also supported in R, MATLAB, GNU Octave and Java through Jython. It has the following solvers available: GS, RS, PSO, NMA, Covariance Matrix Adaptation Evolutionary Strategy, Tree-structured Parzen Estimator, and Sobol sequences.

Auto-WEKA [30], built on top of Weka [25] is able to perform GS, RS, and Bayesian optimization. Auto-sklearn [19] extends the idea of configuring a general machine learning framework with efficient global optimization, introduced with Auto-WEKA. It is built around scikit-learn and automatically searches for the right learning algorithm for a new machine learning dataset and optimizes its hyperparameters.

Following the idea of Auto-Weka, several automated machine learning tools (AutoML) were recently developed. Their ultimate goal is to automatize the end-to-end process of applying machine learning to real-world problems, even for people with no major expertise in this field. Ideally, such a tool will choose the optimal pipeline for a labeled input dataset: data preprocessing, feature selection/extraction, and learning model with its optimal hyperparameters.

Some of the existent AutoML tools are8: MLBox, auto-sklearn, Tree-Based Pipeline Optimization Tool, H2O, AutoKeras. TransmogrifAI9 is an end-to-end AutoML library for structured data written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation.

Commercial cloud-based AutoML services offer highly integrated hyperparameter optimization capabilities. Some of them are offered by well-known companies:
  • Google Cloud AutoML10, a suite of machine learning products that enables developers with limited machine learning expertise to train high-quality models specific to their business needs.

  • Microsoft Azure ML11, which can be used to streamline the building, training, and deployment of machine learning models.

  • Amazon SageMaker Automatic Model Tuning12, which can launch multiple training jobs, with different hyperparameter combinations, based on the results of completed training jobs. SageMaker uses Bayesian hyperparameter optimization.

5 Case studies

5.1 Hyperparameter optimization in fuzzy ARTMAP models

Hyperparameter optimization is not a new concept. For example, in 1986, Grefenstette optimized the hyperparameters of a genetic algorithm using a meta-level genetic algorithm, an intriguing concept at that time [22].

In [5], we optimized the hyperparameters of a class of Fuzzy ARTMAP neural networks, Fuzzy ARTMAP with Input Relevances (FAMR). The FAMR, introduced in [7], is a Fuzzy ARTMAP incremental learning system used for classification and probability estimation. During the learning phase, each training sample is assigned a relevance factor proportional to the importance of that sample.

Our work in [5] was related to the prediction of biological activities of HIV-1 protease inhibitory compounds when inferring from small training sets. In fact, we were forced to use a training set of only 176 samples (molecules in this case), since no other data were available. We optimized the FAMR training data relevances using a genetic algorithm. We also optimized the order of the training data presentation. These optimizations ameliorated the problem of insufficient data and we improved the generalization performance of the trained model. The computational overhead induced by these optimizations was acceptable.

In [6] we also optimized not only the relevances and the order of the training data presentation but also some hyperparameters of the FAMR network. We used again a genetic algorithm. To some extent, the prediction performance improved, but the computational overhead increased significantly and we faced overfitting aspects. According to our experiments, we concluded that using a genetic algorithm for hyperparameter optimization is computationally not feasible for large training datasets.

5.2 Hyperparameter optimization in SVM models

Support vector machine (SVM) classifiers depend on several parameters and are quite sensitive to changes in any of those parameters [15]. For a Gaussian kernel, there are two hyperparameters which define a SVM model: C and \(\gamma \), the parameter of the Gaussian kernel.

In [20], we introduced a dynamic early stopping condition for RS hyperparameter optimization, tested for SVM classification. We reduced significantly the number of trials. The code runs on a multi-core system and has a good scalability for an increasing number of cores. We will describe in the following the main results from [20], omitting details which are less relevant here.

A simplified version of a hyperparameter optimization algorithm is characterized by the objective fitness function f and the generator of samples g. The fitness function returns a classification accuracy measure of the target model. The generator g is in charge of providing the next set of values that will be used to compute the model’s fitness. A hasNext method implemented by the generator offers the possibility to terminate the algorithm before the maximum number of N evaluations is reached, if some convergence criteria is satisfied.

In the particular case of RS, the generator g draws samples from the specific distribution of each of the hyperparameters to be optimized. Our goal is to reduce the computational complexity of the RS method in terms of m, the number of trials. In other words, we aim to compute less than N trials, without a significant impact on the value of the fitness function.

For this, we introduce a dynamic stopping criterion, included in a randomized optimization algorithm (Algorithm 1). The algorithm is a two-step optimizer. First, it iterates for a predefined number of steps n, \(n<<N\), and finds the optimal combination of hyperparameter values, \(temp\_opt\). Then, it searches for the first result better than \(temp\_opt\). The optimal result, opt, is either the first result better than \(temp\_opt\) or \(temp\_opt\) if N is reached.

Given a restricted computational budget, expressed by a target number of trials m, we would like to determine the optimal value for n, the value of n which maximizes the probability of obtaining opt after at most m trials, \(n< m < N\) (where \(m=N/2\) is a reasonable target).

We determined this optimal value: \(n = m/e\). Choosing for n a value greater than the optimal one not only increases the probability of finding the optimal hyperparameter instantiation but also increases the probability of using more trials.

Our result can be used to implement an improved version of the previous algorithm that can automatically set the value of n to N/e. For example, to maximize the chances to obtain the best value, after a target maximum of 150 attempts, we must set n to 150/e (\(\approx 55\)). For a target maximum of 100 attempts, n should be 37, and so on.

We can reverse the problem: Given an acceptable probability \(P_0\) to achieve the best result among the N trials, which is the optimal value for n? For the standard RS algorithm without the dynamic stopping criterion, if all trials are independent, the required number of trials needed to identify the optimum with a probability \(P_0\) is given by \(m=N \cdot P_0\).

With our algorithm, we can compromise, accepting a probability P, \(P < P_0\), to identify \(g_{opt}\) using lees trials.

If all N combinations are tested (when the stopping criterion opt is not activated), P has the lower bound:
$$\begin{aligned} P \ge m/(eN)\cdot \ln {e^2} = 2P_0/e \approx 0.7357P_0. \end{aligned}$$
However, the probability P to find \(g_{opt}\) after less than N trials has the lower bound:
$$\begin{aligned} P \ge m/(eN) = P_0/e \approx 0.3678P_0. \end{aligned}$$

We used our method to optimize five SVM hyperparameters: kernel type (RBF, Polynomial or Linear chosen with equal probability), \(\gamma \) (drawn from an exponential distribution with \(\lambda =10\)); cost (C, drawn from an exponential distribution with \(\lambda =10\)); degree (chosen with equal probability from the set \(\{2, 3, 4, 5\}\)); and coef0 (uniform on [0, 1]).

We run the experiments on six of the most popular datasets from the UCI Machine Learning Repository13 and obtained on par accuracy values with the existing mainstream hyperparameter optimization techniques. Our algorithm terminates after a significantly reduced number of trials compared to the standard implementation of RS, which leads to an important decrease in the computational budget required for the optimization.

5.3 Hyperparameter optimization in CNN models

In [21], we introduced an improved version of the RS method, the Weighted Random Search (WRS) method. The focus of the WRS method is the optimization of the classification (prediction) performance within the same computational budget. We applied this method to CNN architecture optimization. We will describe in a simplified way the WRS algorithm from [21].

Similar to GS and RS, we make the assumption that there is no statistical correlation between the variables of the objective function (hyperparameters). The standard RS technique [10] generates a new multi-dimensional sample at each step k, with new random values for each of the sample’s dimensions (features): \(X^k = \{x_i^k\}, i=1, \ldots ,d\), where \(x_i\) is generated according to a probability distribution \(P_i(x), i=1, \ldots ,d\), and d is the number of dimensions.

WRS is an improved version of RS, designed for hyperparameter optimization. It assigns probabilities of change \(p_{i}, i=1, \ldots ,d\) to each dimension. For each dimension i, after a certain number of steps \(k_i\), instead of always generating a new value, we generate it with probability \(p_i\) and use the best value known so far with probability \(1-p_i\).

The intuition behind the proposed algorithm is that after already fixing \(d_0\) (\(1<d_0<d\)) values, each d-dimensional optimization problem reduces itself to a \(d-d_0\) dimensional one. In the context of this \(d-d_0\) dimensional problem, choosing a set of values that already performed well for the remaining dimensions might prove more fruitful than choosing some \(d-d_0\) random values. To avoid getting stuck in a local optimum, instead of setting a hard boundary between choosing the best combination of values found so far or generating new random samples, we assign probabilities of change for each dimension of the search space.

WRS has two phases. In the first phase, it runs the RS for a predefined number of trials and allows: (a) to identify the best combination of values so far; and (b) to give enough input on the importance of each dimension in the optimization process. The second phase considers the probabilities of change and generates the candidate values according to them. Between these two phases, we run one instance of fANOVA [26], to determine the importance of each dimension with respect to the objective function. Intuitively, the most important dimension (the dimension that yields the largest variation of the objective function) is the one that should change most frequently, to cover as much of the variation range as possible. For a dimension with small variation of the objective function, it might be more efficient to keep a certain temporary optimum value once this has been identified.

A step of the WRS algorithm applied to function maximization is described by Algorithm 2, whereas the entire method is detailed in Algorithm 3. F is the objective function, the value F (X) has to be computed for each argument, \(X^k\) is the best argument at iteration k, whereas N is the total number of iterations.

At each step of Algorithm 3, at least one dimension will change, hence we always choose at least one of the \(p_i\) probabilities to be equal to one. For the other probabilities, any value in (0, 1] is valid. If all values are one, then we are in the case of RS.

Besides a way to compute the objective function, Algorithm 2 requires only the combination of values that yields the best F (X) value obtained so far and the probability of change for each dimension. The current optimal value of the objective function can be made optional, since the comparison can be done outside of Algorithm 2. Algorithm 3 coordinates the sequence of the described steps and calls Algorithm 2 in a loop, until the maximum number of trials N is reached.

The value \(p_{i}\) is the probability of change and \(k_{i}\) the minimum number of required values, for dimension i, \(i=1,\ldots ,d\). We proved that, regardless of the distribution used for generating \(x_i\), if we choose \(k_i,i=1,\ldots ,d\), so that at least two distinct values are generated for each dimension, we have: At any step n, WRS has a greater probability than RS to find the global optimum. Therefore, given the same number of iterations, on average, WRS finds the global optimum faster than RS.

We sorted the function variables with respect to their importance (weights) and assigned their probabilities \(p_i\) accordingly: the smaller the weight of a parameter, the smaller it’s probability of change. Therefore, the most important parameter is the one that will always change (\(p_1=1\)). To compute the weight of each parameter, we run RS for a predefined number of steps, \(N_0<N\). On the obtained values, we applied fANOVA to estimate the importance of the hyperparameters. If \(w_i\) is the weight of the i-th parameter and \(w_1\) is the weight of the most important one, then \(p_i = w_i/w_1, i=1,\ldots ,d\). We optimized the following CNN hyperparameters: the number of convolution layers, the number of fully connected layers, the number of output filters in each convolution layer, and the number of neurons in each fully connected layer.

We generated each hyperparameter according to the uniform distribution and assessed the performance of the model solely by the classification accuracy. For the same number of trials, the WRS algorithm produced significantly better results than RS on the CIFAR-10 dataset.

6 Conclusions and open problems

Determining the proper architecture design for machine learning models is a challenge because it differs for each dataset and therefore requires adjustments for each one. For most datasets, only a few of the hyperparameters really matter. However, different hyperparameters are important on different data sets. There is no mathematical method for determining the appropriate hyperparameters for a given dataset, so the selection relies on trial and error. We should use a customized combination of optimization, search space and training time reduction techniques.

The computational power of SN P systems was well studied. Several variants of these systems are known to have universal computational capability, being equivalent to Turing machines [44]. Meanwhile, as early as in 1943, McCulloch and Pitts [33] asserted that neural networks are computationally universal. This was also discussed by John von Neumann in 1945 [52]. Details about the universal computational capability of neural models can be found in [4, 46].

There are many possibilities for SN P systems for applications like optimization and classification with learning ability [44]. However, up to this moment, we know no attempt to bring problems and techniques from the neural computing area to the SN P systems area [54]. The current SN P systems with weights, like the McCulloch–Pitts neurons (introduced in 1943), are not able to adapt the weights during a learning process. SN P systems with weights were studied both in the generative and the accepting case, but not in an adaptive case. There are few attempts to use neural network learning rules for adapting the parameters of P systems: [53] uses the Widrow–Hoff rule and [23] uses the Hebbian rule to learn parameters.

To create further analogies between adaptive P systems and machine learning models, we should use more advanced parameter learning algorithms to train P systems. At a meta-level, we should then be able to optimize the hyperparameters of these models.

Footnotes

Notes

Acknowledgements

I am deeply grateful to Dr. Gheorghe Păun for his valuable comments on a draft of this paper and for motivating me to finish it.

Supplementary material

41965_2019_23_MOESM1_ESM.bst (32 kb)
Supplementary material 1 (DOCX 32 kb)
41965_2019_23_MOESM2_ESM.bst (29 kb)
Supplementary material 2 (DOCX 29 kb)
41965_2019_23_MOESM3_ESM.bst (28 kb)
Supplementary material 3 (DOCX 27 kb)
41965_2019_23_MOESM4_ESM.clo (4 kb)
Supplementary material 4 (DOCX 3 kb)
41965_2019_23_MOESM5_ESM.cls (47 kb)
Supplementary material 5 (DOCX 46 kb)

References

  1. 1.
    Albelwi, S., & Mahmood, A. (2016). Analysis of instance selection algorithms on large datasets with deep convolutional neural networks. In 2016 IEEE Long island systems, applications and technology conference (LISAT) (pp. 1–5).Google Scholar
  2. 2.
    Albelwi, S., & Mahmood, A. (2016). Automated optimal architecture of deep convolutional neural networks for image recognition. In 2016 15th IEEE International conference on machine learning and applications (ICMLA) (pp. 53–60).  https://doi.org/10.1109/ICMLA.2016.0018.
  3. 3.
    Aman, B., & Ciobanu, G. (2019). Adaptive P systems. Lecture Notes in Computer Science, 11399, 57–72.CrossRefGoogle Scholar
  4. 4.
    Andonie, R. (1998). The psychologiocal limits of neural computation. In M. Kárný, K. Warwick, & V. Kůrková (Eds.), Dealing with complexity: A neural networks approach (pp. 252–263). London: Springer.Google Scholar
  5. 5.
    Andonie, R., Fabry-Asztalos, L., Abdul-Wahid, C. B., Abdul-Wahid, S., Barker, G. I., & Magill, L. C. (2011). Fuzzy ARTMAP prediction of biological activities for potential HIV-1 protease inhibitors using a small molecular data set. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(1), 80–93.  https://doi.org/10.1109/TCBB.2009.50.CrossRefGoogle Scholar
  6. 6.
    Andonie, R., Fabry-Asztalos, L., Magill, L., & Abdul-Wahid, S. (2007). A new fuzzy ARTMAP approach for predicting biological activity of potential HIV-1 protease inhibitors. In 2007 IEEE International conference on bioinformatics and biomedicine (BIBM 2007) (pp. 56–61).  https://doi.org/10.1109/BIBM.2007.9.
  7. 7.
    Andonie, R., & Sasu, L. (2006). Fuzzy ARTMAP with input relevances. IEEE Transactions on Neural Networks, 17(4), 929–941.  https://doi.org/10.1109/TNN.2006.875988.CrossRefGoogle Scholar
  8. 8.
    Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, ICML’09 (pp. 41–48). New York, NY, USA: ACM.  https://doi.org/10.1145/1553374.1553380.
  9. 9.
    Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In 25th Annual conference on neural information processing systems (NIPS 2011), advances in neural information processing systems (Vol. 24). Granada, Spain: Neural Information Processing Systems Foundation.Google Scholar
  10. 10.
    Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.), NIPS (pp. 2546–2554). http://dblp.uni-trier.de/db/conf/nips/nips2011.html.
  11. 11.
    Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281–305.MathSciNetzbMATHGoogle Scholar
  12. 12.
    Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D. D. (2015). Hyperopt: A Python library for model selection and hyperparameter optimization. Computational Science and Discovery, 8(1), 014008. http://stacks.iop.org/1749-4699/8/i=1/a=014008.
  13. 13.
    Cabarle, F. G. C., Adorna, H. N., Pérez-Jiménez, M. J., & Song, T. (2015). Spiking neural P systems with structural plasticity. Neural Computing and Applications, 26(8), 1905–1917.CrossRefGoogle Scholar
  14. 14.
    Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27. Software Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  15. 15.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.  https://doi.org/10.1023/A:1022627411411.CrossRefzbMATHGoogle Scholar
  16. 16.
    Dagher, I., Georgiopoulos, M., Heileman, G. L., & Bebis, G. (1998). Ordered fuzzy ARTMAP: A fuzzy ARTMAP algorithm with a fixed order of pattern presentation. In 1998 IEEE International joint conference on neural networks proceedings. IEEE world congress on computational intelligence (Cat. No. 98CH36227) (Vol. 3, pp. 1717–1722).  https://doi.org/10.1109/IJCNN.1998.687115.
  17. 17.
    Domhan, T., Springenberg, J. T., & Hutter, F. (2015). Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the 24th international conference on artificial intelligence, IJCAI’15 (pp. 3460–3468). AAAI Press. http://dl.acm.org/citation.cfm?id=2832581.2832731.
  18. 18.
    Engelbrecht, A. P. (2001). Selective learning for multilayer feedforward neural networks. In Proceedings of the 6th international work-conference on artificial and natural neural networks: Connectionist models of neurons, learning processes and artificial intelligence-Part I, IWANN’01 (pp. 386–393). London, UK: Springer.CrossRefGoogle Scholar
  19. 19.
    Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., & Hutter, F. (2015). Efficient and robust automated machine learning. In Proceedings of the 28th international conference on neural information processing systems—Volume 2, NIPS’15 (pp. 2755–2763). Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2969442.2969547.
  20. 20.
    Florea, A. C., & Andonie, R. (2018). A dynamic early stopping criterion for random search in SVM hyperparameter optimization. In L. Iliadis, I. Maglogiannis, & V. Plagianakos (Eds.), Artificial intelligence applications and innovations (pp. 168–180). Cham: Springer International Publishing.CrossRefGoogle Scholar
  21. 21.
    Florea, A. C., & Andonie, R. (2019). Weighted random search for hyperparameter optimization. International Journal of Computers Communications & Control, 14(2), 154–169.  https://doi.org/10.15837/ijccc.2019.2.3514.CrossRefGoogle Scholar
  22. 22.
    Grefenstette, J. J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 16(1), 122–128.  https://doi.org/10.1109/TSMC.1986.289288.CrossRefGoogle Scholar
  23. 23.
    Gutiérrez-Naranjo, M. A., & Pérez-Jiménez, M. J. (2009). Hebbian learning from spiking neural P systems view. In D. W. Corne, P. Frisco, G. Păun, G. Rozenberg, & A. Salomaa (Eds.), Membrane computing (pp. 217–230). Berlin: Springer.CrossRefGoogle Scholar
  24. 24.
    Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.zbMATHGoogle Scholar
  25. 25.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.  https://doi.org/10.1145/1656274.1656278.CrossRefGoogle Scholar
  26. 26.
    Hutter, F., Hoos, H., & Leyton-Brown, K. (2014). An efficient approach for assessing hyperparameter importance. In Proceedings of the 31th international conference on machine learning, ICML 2014, Beijing, China, 21–26 June 2014 (pp. 754–762).Google Scholar
  27. 27.
    Ionescu, M., Păun, G., & Yokomori, T. (2006). Spiking neural P systems. Fundamenta Informaticae, 71, 279–308.MathSciNetzbMATHGoogle Scholar
  28. 28.
    Kabkab, M., Alavi, A., & Chellappa, R. (2016). DCNNs on a diet: Sampling strategies for reducing the training set size. CoRR abs/1606.04232. arxiv:1606.04232.
  29. 29.
    Kewley, R. H., Embrechts, M. J., & Breneman, C. (2000). Data strip mining for the virtual design of pharmaceuticals with neural networks. IEEE Transactions on Neural Networks, 11(3), 668–679.CrossRefGoogle Scholar
  30. 30.
    Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., & Leyton-Brown, K. (2017). Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research, 18(25), 1–5. http://jmlr.org/papers/v18/16-261.html.
  31. 31.
    Lemley, J., Jagodzinski, F., & Andonie, R. (2016). Big holes in big data: A Monte Carlo algorithm for detecting large hyper-rectangles in high dimensional data. In 2016 IEEE 40th annual computer software and applications conference (COMPSAC) (Vol. 1, pp. 563–571).  https://doi.org/10.1109/COMPSAC.2016.73.
  32. 32.
    Li, L., Jamieson, K. G., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2016). Efficient hyperparameter optimization and infinitely many armed bandits. CoRR abs/1603.06560. arxiv:1603.06560.
  33. 33.
    McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133.MathSciNetCrossRefGoogle Scholar
  34. 34.
    Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. The Computer Journal, 7, 308–313.MathSciNetCrossRefGoogle Scholar
  35. 35.
    Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review, 34(2), 133–143.CrossRefGoogle Scholar
  36. 36.
    Păun, G. (2000). Computing with membranes. Journal of Computer and System Sciences, 61(1), 108–143.MathSciNetCrossRefGoogle Scholar
  37. 37.
    Păun, G., Rozenberg, G., & Salomaa, A. (2010). The Oxford handbook of membrane computing. Oxford: Oxford University Press.CrossRefGoogle Scholar
  38. 38.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetzbMATHGoogle Scholar
  39. 39.
    Plutowski, M., & White, H. (1993). Selecting concise training sets from clean data. IEEE Transactions on Neural Networks, 4(2), 305–318.  https://doi.org/10.1109/72.207618.CrossRefGoogle Scholar
  40. 40.
    Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining KDD’99 (pp. 23–32). New York, NY, USA: ACM.Google Scholar
  41. 41.
    Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. CoRR abs/1811.12808. arxiv:1811.12808.
  42. 42.
    Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2018). Regularized evolution for image classifier architecture search. CoRR abs/1802.01548. arxiv:1802.01548.
  43. 43.
    Refaeilzadeh, P., Tang, L., & Liu, H. (2007). On comparison of feature selection algorithms. In AAAI Workshop—technical report (Vol. WS-07-05, pp. 34–39).Google Scholar
  44. 44.
    Rong, H., Wu, T., Pan, L., & Zhang, G. (2018). Spiking neural P systems: Theoretical results and applications. In C. Graciani, A. Riscos-Núñez, G. Păun, G. Rozenberg, & A. Salomaa (Eds.), Enjoying natural computing: Essays dedicated to Mario de Jesús Pérez-Jiménez on the occasion of his 70th birthday (pp. 256–268). Cham: Springer International Publishing.  https://doi.org/10.1007/978-3-030-00265-7_20.CrossRefGoogle Scholar
  45. 45.
    Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & de Freitas, N. (2016). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148–175.  https://doi.org/10.1109/JPROC.2015.2494218.CrossRefGoogle Scholar
  46. 46.
    Siegelmann, H. T., & Sontag, E. D. (1992). On the computational power of neural nets. In Proceedings of the fifth annual workshop on computational learning theory, COLT’92 (pp. 440–449). New York, NY, USA: ACM.  https://doi.org/10.1145/130385.130432.
  47. 47.
    Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Machine learning: proceedings of the eleventh international conference (pp. 293–301). Morgan Kaufmann.Google Scholar
  48. 48.
    Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems 25: 26th annual conference on neural information processing systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, United States (pp. 2960–2968).Google Scholar
  49. 49.
    Song, T., Pan, L., Wu, T., Zheng, P., Wong, M. L. D., & Rodríguez-Patón, A. (2019). Spiking neural P systems with learning functions. IEEE Transactions on Nanobioscience, 18(2), 176–190.  https://doi.org/10.1109/TNB.2019.2896981.CrossRefGoogle Scholar
  50. 50.
    Sun, Y., Gong, H., Li, Y., & Zhang, D. (2019). Hyperparameter importance analysis based on N-RReliefF algorithm. International Journal of Computers Communications & Control, 14(4), 557–573.CrossRefGoogle Scholar
  51. 51.
    Sunkad, Z. A., & Soujanya (2016). Feature selection and hyperparameter optimization of svm for human activity recognition. In 2016 3rd International conference on soft computing machine intelligence (ISCMI) (pp. 104–109).  https://doi.org/10.1109/ISCMI.2016.30.
  52. 52.
    von Neumann, J. (1993). First draft of a report on the EDVAC. IEEE Annals of the History of Computing, 15(4), 27–75.  https://doi.org/10.1109/85.238389.MathSciNetCrossRefzbMATHGoogle Scholar
  53. 53.
    Wang, J., & Peng, H. (2013). Adaptive fuzzy spiking neural P systems for fuzzy inference and learning. International Journal of Computer Mathematics, 90(4), 857–868.  https://doi.org/10.1080/00207160.2012.743653.MathSciNetCrossRefzbMATHGoogle Scholar
  54. 54.
    Wang, J. J., Hoogeboom, H. J., Pan, L., Păun, G., & Pérez-Jiménez, M. J. (2010). Spiking neural P systems with weights. Neural Computation, 22, 2615–2646.MathSciNetCrossRefGoogle Scholar
  55. 55.
    Wang, X., Song, T., Gong, F., & Zheng, P. (2016). On the computational power of spiking neural P systems with self-organization. Scientific Reports, 6, 27624.CrossRefGoogle Scholar
  56. 56.
    Westbrook, J., Berman, H. M., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., et al. (2000). The protein data bank. Nucleic Acids Research, 28, 235–242.CrossRefGoogle Scholar
  57. 57.
    Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., & Keutzer, K. (2018). Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CoRR abs/1812.03443. arxiv:1812.03443.
  58. 58.
    Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. CoRR abs/1611.01578. arxiv:1611.01578.

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Computer Science DepartmentCentral Washington UniversityEllensburgUSA

Personalised recommendations