1 Introduction

In predictive modeling, the algorithm selection step is often responsible for sub-optimal results and/or computational complexity of the modeling stage.

Many algorithm selection approaches (Kordík et al. 2011; Sutherland et al. 1993; Bensusan and Kalousis 2001; Botia et al. 2001) simply identify the best algorithm from a set of candidates. This set of candidate algorithms needs to be constructed first. The prevailing approach is to pick candidates manually from a set of available algorithms. However algorithm performance evaluation involves multiple runs of the algorithm (e.g. cross validation) and it is very time consuming. The number of candidate algorithms is high even when only default parameter settings for individual algorithms are considered.

In machine learning, parameters of algorithms are quite important, many of them having direct impact on plasticity of generated predictive models. Often, candidate algorithms are evaluated with their default parameter settings. More sophisticated algorithm selection approaches include parameter optimization as part of the selection process. Recent studies have showed the potential of Bayesian methods (Hutter et al. 2011) outperforming both random search (Bergstra et al. 2011) and grid search (Coope and Price 2001).

When ensembles of algorithms (Brown et al. 2006) are taken into account (and they should be considered because of their superb performance on many predictive tasks Hoch 2015; Stroud et al. 2012) the problem of algorithm selection becomes even more difficult. There is a potentially an infinite number of possible candidate algorithms, their parametrizations and ensembles to choose from. Furthermore, the generalization performance of algorithms is not the only quality criterion. For large scale machine learning tasks, the algorithm run-time is of great importance.

It is necessary to take into account the computational complexity of the algorithm selection process. It is not feasible to run all candidate algorithms on a new data set to select the best performing one, simply because there are an infinite number of available algorithms.

One approach to select a training algorithm for a new data set in a reasonable time is to use a meta-data collected during training on similar data sets. Meta-learning approaches (Kordík et al. 2011) utilizing meta-data have been studied intensively in the past few decades. They can predict performance of algorithms on new data sets and consistently is posible to select good performing algorithm among multiple candidates.

The majority of meta-learning approaches (Kordík et al. 2011; Sutherland et al. 1993; Bensusan and Kalousis 2001; Botia et al. 2001) simply select one from a set of few predefined fully specified data mining algorithms. Selected algorithm produces models with the best generalization performance for the given data. Later methods incorporate a further algorithm selection step on the top of recommendations (Sun and Pfahringer 2013).

More advanced meta-learning approaches combine algorithm selection and hyperparameter optimisation such as CASH (Thornton et al. 2013) elaborated within the INFER project. In Salvador et al. (2016a, b, c) data mining workflows are optimized including data cleaning and preprocessing steps, together with selected hyperpameters of modeling methods.

We focus on modeling stage only and optimize structure of algorithmic ensembles together with their hyperparameters as explained in Sect. 5. In this way, we can discover new algorithmic building blocks.

The hierarchical structure of algorithmic ensembles is represented by Meta-learning algorithm templates introduced in Sect. 5. Our templates indicate which learning algorithms are used and how their outputs are fed into other learning algorithms.

We use the genetic programming (Koza 2000) to evolve the structure of templates and a parameter optimization to adjust parameters of algorithms for specific data sets. Our approach allocates exponentially more time for evolution of above-average templates which is similar to the Hyperband approach (Li et al. 2016). In later stages of our algorithm, templates that survived from previous generation have more and more time to show their potential.

We show that evolved templates can be successfully used to generate models on similar data sets.

Furthermore, templates evolved (discovered) on small data sets can be reused as building blocks for large data sets.

Building predictive models on large data samples is a challenging task. Learning time of algorithms often grows fast with the number of training data samples and dimensionality of a data set. Hyper-parameter optimization can help us to generate more precise models for given task, but it adds significant computational complexity to the training process. We show that templates evolved on small data subsamples can outperform state of the art algorithms including complex ensembles in terms of performance and scalability.

The next section discusses related work and shows how recent results in the field of meta-learning and automated machine learning are relevant for our research. Before we define the concept of meta-learning templates in Sect. 4, we need to describe building blocks of our templates (base algorithms and ensemble methods in Sect. 3). Introduction to templates is followed by a brief explanation of the evolutionary algorithm designed to evolve templates (Sect. 5). Experiments described in later sections aim to show that hierarchical templates can outperform standard ensembles (on standard benchmarking data samples in Sect. 6), they can be used for transfer learning and scaled up for large scale modeling (Sect. 9).

2 Related work

This contribution is tightly related to meta-learning. The definition of meta-learning is very broad. One of the early machine learning related definition (Vilalta and Drissi 2002) states that a meta-learning system must include a learning subsystem, which adapts with experience.

Another definition (Brazdil et al. 2009) requires meta-learning to start at a higher level and be concerned with accumulating experience over several applications.

Finally according to Vanschoren (2010) meta-learning monitors the automatic learning process itself and tries to adapt its behaviour to perform better.

2.1 Knowledge base meta-learning approaches and workflows

One of the main direction in meta-learning is constructing a meta-level system utilizing a knowledge repository (Vanschoren et al. 2012; Brazdil et al. 2009). The repository is intended to store a meta-data describing problem being solved and the performance of base learners.

Then for any new problem, one looks at problems with similar meta-data to select best performing algorithms (Kordík et al. 2011). ESPRIT Statlog (Sutherland et al. 1993) compared the performance of numerous classification algorithms on several real-world data sets. In this project, metadata (statistical features describing the data sets) were used for algorithm recommendation. The MetaL project (Bensusan and Kalousis 2001), built upon Statlogs outcomes, utilized landmarking (Pfahringer et al. 2000) metadata (results of fast algorithms, executed on a data set in order to determine its complexity). Ranking of algorithms can be obtained by fast pairwise comparisons (Leite and Brazdil 2010) just on the most useful cross-validation tests (Leite et al. 2012). Another project was METALA (Botia et al. 2001), an agent-based distributed data mining system, supported by meta-learning. Again, the goal was to select from among available data mining algorithms the one producing models with the best generalization performance for given data.

The problem with recommending a particular algorithm is that the portfolio of algorithms is potentially infinite. Especially “Frankenstein” ensembles winning Kaggle competitions (Puurula et al. 2014) are good example how complex the topology of machine learning ensembles can be. In Bonissone (2012) so called lazy meta-learning is applied to create customized ensembles on demand. Individual models, ensembles and combination of ensembles in time series forecasting can be selected adaptively (Lemke and Gabrys 2010) by meta-learning.

Recommendation and optimization of data mining workflows (Grabczewski and Jankowski 2007; Jankowski 2013; Sun et al. 2013) is another important research direction aiming at automation in data science. In this article, we optimize hierarchical modeling templates that are more general than simple ensembles but still narrow enough when compared to universal data mining templates including data preparation. Planning and optimization of full data mining workflows is also elaborated in Nguyen et al. (2014); Kietz et al. (2012), where meta model and AI planer are combined. On the contrary, we focus on predictive modeling stage only and we extend the search to the domain of hierarchical ensembles of predictive algorithms.

2.2 Ensembling as meta-learning

Even simple model ensembling methods such as Boosting (Schapire 1990), Stacking (Wolpert 1992) or Cascade generalization (Gama and Brazdil 2000) can be considered meta-learning methods with respect to the above definitions of the meta-learning. They all use information from previous learning steps to improve the learning process itself. There are many more ensembling approaches and these can be even further combined in a hierarchical manner resembling structures in the human brain as we show in this article.

Theoretical derivation and experimental confirmation that hierarchical ensembles are the best performing option for some classification problems can be found in Ruta and Gabrys (2002, 2005).

Aggregation or hierarchical combination of ensembles has been studied (Analoui et al. 2007; Costa et al. 2008; Sung et al. 2009) intensively not only in predictive modeling. In particular, gradient boosting (Friedman 2000) and multi-level stacking of neural networks (Bao et al. 2009) were parts of the winning solution in the Netflix competition (Töscher and Jahrer 2009; Bennett et al. 2007).

These hierarchical ensembles are single purpose architectures often tailored to one particular problem (data set), where they exhibit excellent performance, but very likely fail with different data. The prevailing approach to constructing these ensembles is manual trial-and-error combined with extensive hyper-parameter optimization.

2.3 Growing ensembles and their optimization

One of the first growing ensembles introduced was the GMDH MIA approach (Mueller et al. 1998) that can be also considered as adaptive layered stacking of models. Our GAME neural networks (Kordík 2009) grow inductively from data to match the complexity of given task and maximize the generalization performance.

Another growing ensemble of neurons (or network) called NEAT (Stanley and Miikkulainen 2001) was primary designed for reinforcement learning controllers. Evolutionary approaches are used to optimize topology and parameters of these ensembles.

When it comes to optimization of ensembles, genetic programming was also used to evolve trees of ensemble models, as suggested in Hengpraprohm and Chongstitvatana (2008), but only to a limited degree with only one type of ensemble, and the article deals with the Cancer data only.

Interesting approach to ensemble building (Caruana et al. 2004) is to prepare ensembles from libraries of models generated using different learning algorithms and parameter settings of algorithms.

The Neural Network ensembling method (GEMS), proposed in Ulf Johansson (2006) trains models independently, then combines them using genetic programming into trivial hierarchical ensemble using weighted average. Weights are evolved by means of genetic programming rather than derived from model performance as in Boosting for instance.

Multi-component, hierarchical predictive systems can be constructed by a grammar-driven genetic programming in Tsakonas and Gabrys (2012) an approach very similar to ours. They used very limited ensembling templates, trivial base models and focus on maintaining diversity during evolution. We focus more on time efficiency.

Some of the modern scalable neural networks (Buk et al. 2009; Smithson et al. 2016; Fernando et al. 2016) can be constructed using indirect encoding. You can evolve structures at macro level (Real et al. 2017) optimizing large building blocks or at micro level (Zoph and Le 2016) optimizing internal structure of neuron cells. Neuroevolution of deep and recurrent networks (Miikkulainen et al. 2017; Rawal and Miikkulainen 2016) is computationally expensive but results looks very promising.

In predictive modeling and supervised learning, it is often more efficient to optimize continuous parameters of algorithms independently of the topology (in contrast to TWEANN approach Stanley and Miikkulainen 2001). Most popular approach for continuous hyperparameter optimization is a simple brute force grid search or random search (Bergstra and Bengio 2012). More sophisticated approaches are based on Bayesian methods (Salvador et al. 2016b). Recently introduced Bandit based method HyperBand (Li et al. 2016) uses performance of base learners to speed up the learning process and can be therefore considered a meta-learning approach. When learning of models can be prematurely terminated, we can save significant amount of resources, speeding up learning by giving more resources to promising learners. Disadvantage of this approach can be that complex models need more time to adapt and it is hard to estimate their final performance in early stages of learning.

The CASH approach in Auto-WEKA (Thornton et al. 2013) combines algorithm selection and hyperparameter optimization (Hutter et al. 2011) in the classification domain. In our approach we optimize the topology of ensembles as well.

2.4 Scalable meta-learning

The proper topology for a given problem is particularly important when machine learning models are evaluated by multiple criteria. The most important criterion is the generalization performance, but often also time of model training/recall should be taken into consideration (Chan and Stolfo 1997; Sonnenburg et al. 2008).

Our results on the Airline data set suggests that simple ensemble of sigmoid models can significantly outperform deep learning models (Arora et al. 2015) when it comes to scalability and learning efficiency.

A recent paper on large scale evolution of image classifiers (Real et al. 2017) is another example of time sensitive approach, where one can trade-off generalization performance for learning/recall speed.

Anytime learning (Grefenstette and Ramsey 2014) aims at building algorithms capable of returning best possible solution given a training time. Anytime ensembling methods such as Speedboost (Grubb 2014) can not only generate approximate models rapidly from weak learners, but they are capable of using extra time resources, when available, to further improve their performance. In this manner, we developed evolutionary search in Sect. 5.

Before discussing anytime optimization of our ensembles, we describe the building blocks and ensembling mechanisms used.

3 Base algorithms and ensembling strategies

We build ensembles from fast weak learners (Duffy and Helmbold 1999). Many of our base models resemble neurons with different activation functions. We can use base models to construct both classification and regression ensembles. In this article, we focus on classification tasks only, however regression models can also be present in classification ensembles.

The classification task itself can be decomposed into regression subproblems by separation of single classes from the others. These binary class separation problems can be approximated by regression models—by estimating continuous class probabilities. The maximum probability class is then considered as output value. The classifier consisting of regression models is further referred to as ClassifierModel.

3.1 Base algorithms

Training regression models (as components of probabilistic classifiers) is fast and straightforward. We use several activation functions in simple perceptrons, namely Sigmoid, SigmoidNorm, Sine, Polynomial, Gaussian, Exponential and Linear.

To train coefficients of linear or polynomial models, the General Least Squares method (Marquardt 1963) is applied. For models that are non-linear in their coefficients, an iterative optimization process is needed. We compute analytic gradients of error for all fast regression models and employ quasi-Newton method (Shanno 1970) to optimize their parameters.

The LocalPolynomial base model as well as Neural Network (NN), Support Vector Machine (SVM), Naive Bayes classifier (NB), Decision Tree (DT), K-Nearest Neighbor (KNN) were adopted from the RapidMiner environment (RapidMiner ).

3.2 Ensembling algorithms

The performance of models can often be further increased by combining or ensembling (Brazdil et al. 2009; Kuncheva 2004; Wolpert 1992; Schapire 1990; Woods et al. 1997; Holeňa et al. 2009) base algorithms, particularly in cases where base algorithms produce models of insufficient plasticity or models overfitted to training data (Brown et al. 2006).

A detailed description of the large variety of ensemble algorithms can be found in Brazdil et al. (2009). We briefly describe the ensembling algorithms that are used in our experiments. Bagging (Breiman 1996) is the simplest one; it selects instances for base models randomly with repetition and combines models with simple average. Boosting (Schapire 1990) specializes models on instances incorrectly handled by previous models and combines them with a weighted average. Stacking (Wolpert 1992) uses a meta model, which is learned from the outputs of all base models, to combine them. Another ensemble utilizing meta models is the Cascade Generalization (Gama and Brazdil 2000), where every model except the first one uses a data set extended by the output of all preceding models. Delegating (Ferri et al. 2004) and Cascading (Alpaydin and Kaynak 1998; Kaynak and Alpaydin 2000) both use a similar principle: they operate with certainty of model output. The latter model is specialized not only in instances that are classified incorrectly by previous models, but also in instances that are classified correctly, but previous models are not certain in terms of their output. Cascading only modifies the probability of selecting given instances for the learning set of the next model. Arbitrating (Ortega et al. 2001) uses a meta-model called referee for each model. The purpose of this meta-model is to predict the probability of correct output. All methods used in this study were implemented within the FAKE GAME open source project (Fake Game ).

4 Meta-learning templates

The meta-learning template (Kordík et al. 2011) is a prescription how to build hierarchical supervised models. In the most complex case, it can be a collection of ensembling algorithms and base algorithms combined in a hierarchical manner, where base algorithms are leaf nodes connected by ensembling nodes. Regression models or classifiers deeper in the hierarchy can be more specialized to a particular subset of data samples or attributes. This scheme decomposes the prediction problem into subproblems and combines the final solution (model) from subsolutions. The procedure of problem decomposition depends on ensembling methods. Typically, it distributes data to member models and when all outputs are available, they are combined to the ensemble output.

Fig. 1
figure 1

An example of hierarchical combination of algorithms. Using this meta-learning template, a classifier can be produced (see Fig. 2). The template can be represented by a a tree, b embedded boxes or c by text

Note that meta-learning templates are not data mining models, but algorithms. Models are produced when templates are executed.

Figure 1 shows an example of a meta-learning template. When executed, the full training data set is passed to a top level Bagging that generates 4 bootstrap training data sets for members of the ensemble. The second bootstrap training data set is used to train a KNN classifier by Boosting and samples, where this classifier demonstrates high error, are more likely to be used in the training set for the second member model of the Boosting: the stacking of NN and DT classifiers. Bottom level NN and DT are evaluated on the training data and upon their responses a SVM meta-model is trained. The Stacking is evaluated and a weight is assigned to its output in Boosting. The output of Boosting is averaged with the other three top level base models and the whole classifier is finished (see the left-hand tree in Fig. 2).

Fig. 2
figure 2

An ensemble classifier can be produced by the hierarchical combination of algorithms depicted in Fig. 1. Executing the template will distribute data to leaf base models according to procedures specified by ensembling algorithms. Base models and ensembles are constructed until the root ensemble (base model) is finished. Using the model involves propagating and presenting an input vector to leaf models and combining their outputs by ensembling procedures

The resulting classifier is depicted in Fig. 2. The tree in the center shows how the input attributes are presented to the model. The propagation of input vector is straightforward in this example, but some ensembles (e.g. Cascading) involve evaluation of member models (their outputs are added to input vectors of subsequent models). The right-hand tree shows how outputs of base models are blended to produce the final output.

Whereas data mining workflows are directed acyclic graphs, meta-learning templates are hierarchical structures. Inner nodes in our templates are ensembling algorithms and leaf nodes are base algorithms. Fully predefined templates are algorithm configurations containing parameters of both ensembles and base algorithms. Templates can be generalized using wildcards (see Fig. 3) to represent a subspace of the search space of topologies and parametrizations of hierarchical ensembles.

Fig. 3
figure 3

Nested ensembles can be represented by a template. Using wildcards, specific (or predefined in other words) template can be generalized to represent set of templates

Similarly to the Holland’s schema theorem (Holland 1975), we can define fitness of a template as average/maximum fitness of individual algorithms represented by this particular template. Wildcards here are used just as placeholders for random decisions on type of ensembles or base algorithms and their parameters. On the contrary, in rooted tree schema theory (Rosca 1997) wildcards represent sub-trees.

5 Discovering templates

The meta-learning template can be designed manually using an expert knowledge (for example, bagging boosted decision trees showed good results on several problems) so it is likely to perform well on a new data set. This is however not guaranteed.

In our approach we optimize templates on data sub-samples using a genetic programming (Koza 2000). In this way, we can search the space of possible architectures of hierarchical ensembles and optimize their parameters simultaneously.

5.1 Evolving templates by genetic programming

Applying genetic programming or grammatical evolution involves resolving (a) representation of individual, (b) design of genetic operators and evolution, (c) fitness function formulation and (d) construction of initial population.

figure a

5.1.1 Encoding templates to chromosomes

Encoding is straightforward, because in genetic programming (GP) individuals are represented as trees. Each specific template has ensembles in inner nodes and base algorithms in leaf nodes whereas their parameters are associated with corresponding nodes (not encoded as individual nodes as in Koza’s representation). Generalized templates contain wildcards in their chromosomes. Wildcards are represented as lists of genes. One of these genes is randomly selected when an individual should be produced from a template. Meaning when 20 base models should be generated, the heuristics selects randomly twenty times from list of available algorithms.

5.1.2 Adaptive control for anytime learning

The pseudo-code of Algorithm 1 shows how to evolve meta-learning templates. There are two parameters, a time limit for the algorithm and an attribute that decides whether a metadatabase should be used to streamline the evolution. Later we discuss advantages and disadvantages of using the metadatabase.

The algorithm has several internal parameters and many of them are adaptive. Time limit influence most of internal parameters, because only fast templates on small data samples can be evaluated for small time allocations. With more time available the search for best performing meta-learning templates can intensify and explore bigger part of the search space.

The algorithm receives a data set as an input. When the data set has more than 200 dimensions or 500 instances (constants experimentally chosen based on results on several data sets), a sample is generated using random subsampling or stratified sampling in case of small or imbalanced data. We sample both instances and attributes when constrains are violated to get a representative data subset.

5.1.3 Initial population and subsequent evolutions

An initial population of the first evolution (generalized templates) is generated from a minimal form, similarly to Stanley and Miikkulainen (2001); Mueller et al. (1998). In case that the metadatabase is not used, base models form the population. The advantage is that each type of base model is considered before ensembles are taken into account. Also the population grows from a minimal form. With the metadatabase, the initial population is filled by best individuals from most similar meta data (pairwise similarities of attributes statistics). For subsequent evolutions, we use population from the last epoch of the previous evolution.

While time is available, we run a sequence of evolutions that are gradually exploring the state space of possible templates. The first evolution runs on a small data sample (\(200\times 500\) maximum) and after maximum of hundred generations (or when a stagnation is detected), data is doubled (both dimensionality and numerosity if possible) and next evolution follows. In each subsequent evolution, templates are more specific and the percentage of wildcards decrease. Also, ranges of explored parameters increase as templates get more precise and specific.

This is quite similar to the Hyperband approach (Li et al. 2016), when exponential more time is given to perspective learners. Here, many template topologies are eliminated on a small data subset. Just the most successful templates are examined on larger sets and their parameters are extensively finetuned.

The optimization process is designed to be time-constrained. For each algorithm, we estimated its scalability so that we can predict the run-time given size of a data set and parameter settings. Parameters like maximum template depth, maximum allowed computational complexity of template, intervals of base algorithm parameters, size of data samples are then increased adaptively set based on time available. Similarly as in Li et al. (2016), we give exponential allocation to search in promising parts of the state space. For particular details see [Software: Fake game, data mining software (https://fakegame.sourceforge.net/)], (Kordík et al. 2014) or our open source implementation (Kordík 2006).

5.1.4 Fitness evaluation

Fitness evaluation is also time-effective. It is estimated by a multiple crossvalidation (CV) (Browne 2000; Kordík et al. 2014). The fitness of a template is proportional to the average performance of models generated on training folds and evaluated on testing folds, while the data is divided into folds multiple times. We need a reliable estimate of a generalization performance of models/templates even when time allocated for the optimization is very short (e.g. 1  min). For short time allocations, data samples are small (up to 300 instances) and repeated CV runs are necessary to reduce variance of cross-validation estimates. With more time available, our fitness estimates are refined with additional fitness evaluations starting with the most promising estimates with high variation of recent evaluations. An approach similar to Moore and Lee (1994) or (Li et al. 2016) helping us to allocate additional resources for promising candidates.

After fitness evaluation, the selection is implemented by a tournament. We do not use crossover, just mutations similar to the approach used in a standard GP (Koza 2000). Mutations grow/modify both topology and parameters of templates.

Structural mutations are realized using the context free grammar (Whigham 1995) rules shown in Table 1 defining how templates can grow from simple base classifiers to large hierarchical ensembles often containing regression ensemble sub-trees.

Table 1 Context free grammar rules

Parameters of a node are mutated by applying Gaussian noise to the current value. The mutation probabilities and distribution of noise are controlled by adaptive parameters for anytime learning.

Exploration versus exploitation capabilities of evolutions are ale influenced by adaptive mutation probabilities and intervals. For exact parameter settings and adaptation strategies please consult (Fake Game ; Kordík et al. 2014).

5.1.5 Metadatabase

When a metadatabase is enabled, the population of general templates can be then seeded from this metadatabase. The probability that a template is selected for seeding the population is inversely proportional to the squared distance of meta data vectors and proportional to a robust performance of the template. The robust performance is defined as average rank of template performance on similar data sets. Then, as the algorithm runs, templates consisting of one base algorithm are evaluated on the data set and stored into the metadatabase. Their performance is used as landmarking attribute (Pfahringer et al. 2000) and together with data statistics make up meta-features. The meta-features vector is then compared to other vectors stored in the metadatabase and the most similar records are returned. The records contain a list of best templates which are inserted into the initial population. The fitness of each template is updated during evolutions and when the optimization terminates, winning templates are saved as a new record into the metadatabase or corresponding records are updated with the new templates.

Section 7.5 provides experimental results showing that using the templates from a metadatabase is beneficial for most of the data sets. On the other hand, a metadatabase can lead to templates overfiting and one should avoid seeding the initial population with templates evolved on data that has been already used, as we show later in the experimental part.

5.2 Exploring models produced by templates

The final template is comprehensively tested and the generalization performance of models generated by this template should be the highest among candidate templates. The quality of the selected template can also be observed in the shape and consistency of decision boundaries of models produced from selected template.

As an example, we ran the evolution on the Two Intertwined Spirals data set (Juille and Pollack 1996) (10 min on a standard PC). The template that was finally selected can be written as: ClassifierCascadeGenProb{4\(\times \) KNN(k=2,vote=true, measure=ManhattanDistance)}. We used our RapidMiner plugin [Software: Fake game, data mining software (https://fakegame.sourceforge.net/)] to visualize the structure and behavior of the classifier produced when this template was executed. The template contains the ClassifierCascadeGenProb ensemble of three 2NN classifiers. In the Cascade Generalization (Gama and Brazdil 2000) ensemble, every model except the first one uses a data set extended by the output of all previous models. In this particular case, the first 2NN classifier is produced on the Spiral data set, the input of the second 2NN classifier is enriched by two outputs of the first classifier (probabilities of membership in one of the two intertwined spirals). The third classifier receives two original ’spiral’ inputs plus four output probabilities from the already generated classifiers, etc.

This behavior can be observed in Fig. 4a. As can be seen in the thumbnail images, where the background color should match the color of data points for the perfect classifier, the first KNN algorithm is capable of making a nearly perfect model, except for small regions with absent learning data. The other classifiers specialize in these regions, so the final cascade ensemble classifies the Spiral data even better. Figure 4b shows the decision boundaries of a recently evolved template that outperformed the Cascade generalization of KNN classifiers: ClassifierModel{outputs \(\times \) LocalPolynomialModel}}. The LocalPolynomialModel was added to our base algorithm recently and it apparently performs better than KNN on this problem. The evolved algorithm works as follows. It builds a lazy model based on the LocalPolynomial regression to model probability of each class (spiral) given the input coordinates. Final output is decided by ClassifierModel choosing higher probability returned by two LocalPolynomial models.

Fig. 4
figure 4

The template performing cascade generalization ensemble of three 2NN classifiers was discovered by the evolution on the Spiral problem. The thumbnail images show the response of classifiers to the change of their two most relevant inputs

Templates evolved on the Spiral data should also produce good models for similar problems, e.g. for any other complex separation problem in two dimensions. Experiments described in the sections are to reveal the universality of discovered templates.

6 Small data sets for evaluation

To evaluate transfer learning capabilities (Pan and Yang 2010) of meta-learning templates, it is necessary to experiment with a wide range of data sets. First of all, we use small data sets of different complexity.

Table 2 lists the data sets used as well as their size, dimensionality and number of classification classes (outputs). Most of the data sets are taken from the UCI repository (Frank and Asuncion 2010). Other data sets (mostly artificial) are tailored to evaluate data separation capabilities of algorithms for low dimensional problems. The Spirals data set was used in the previous section and was designed as a benchmark for global approximation methods.

Spread is a two-dimensional artificial data set, which was created with an evolutionary algorithm to be unsolvable by the basic classification algorithms available in the RapidMiner. The fitness function was inversely proportional to the performance of the best classifier and the chromosomes contained parameters of a data set generator.

Data sets (Texture1 and Texture2) come from a generator of images for pattern recognition (Texture 2008). Four features were extracted from these images, one using the local binary pattern and the other three with a \(5\times 5\) convolution matrix for each color component (rgb) We generated balanced data sets with 250 instances for each class (segment). Texture1 was formed by three segments (750 instances) and Texture2 by ten segments (2500 instances).

Table 2 Data sets are obtained mostly from the UCI repository and are small to medium-sized

Splitting data into learning and testing sets to avoid overfitting is a well known principle. In the process of evolution of templates it is necessary to estimate the quality of templates and to balance well the data used for learning and for evaluation. Also, when some testing data is used to select best performing template, we should not use it for testing any more, because the error estimate might be biased.

7 Examining properties of templates

First of all, we designed an experiment to verify if hierarchical ensembles can outperform simple ensembles and base model. For this experiment, we selected the Glass data set, because it has quite complex decision boundaries and hierarchical ensembles can therefore reveal their potential.

7.1 Hierarchical ensemble

We compare three configurations of the optimization process with the same time allocation of 2 h. In the first configuration, we restrict the search to trivial templates with base algorithms only (depth 0). The second configuration allows the evolution to consider also ensembles of base algorithms (depth 1). The third configuration extends the search to the second level hierarchies–ensembles of ensembles (depth 2). There was no additional restriction on type of base models or ensembles. We used 20 fold cross validation to increase size of the train sets.

Table 3 A trivial template (maximal template depth limited to 0) was evolved on the Glass data set for given time (this is equivalent to the selection among base algorithms with optimized parameters), then we run same experiment with maximal depth 1 (simple ensembles of base algorithms allowed) for the same amount of time and so on

Table 3 shows that the generalization performance of templates increases with depth. It is interesting that all levels were dominated by the same base model (Decision Tree). We can conclude that for this particular problem (Glass data) the hierarchical ensemble represents significant improvement over base models or regular ensembles.

Of course, ensembles and hierarchical ensembles are not always beneficial (for example when problems are linearly separable, there is no need for ensembles, because most of base algorithms are capable of solving the task perfectly alone). That is also the motivation for our optimization procedure to explore trivial templates first and then gradually extend the search space to more complex ensembles and hierarchies.

7.2 Template overfitting

The next experiment is to examine the sensitivity of meta-learning templates to data overfitting. A the same time, we will explore the robustness of our approach in terms of generating stable solutions for very similar problems.

Fig. 5
figure 5

The workflow evaluating sensitivity to overfitting and stability of solution. Note that validation error is not computed just from the validation set but also from learn set, because the multiple CV is performed

The experimental setup is rather complicated so we use Fig. 5 to illustrate it. The Ecoli data set was divided into two folds of equal size (training and testing). The training fold was subsequently divided into learn and validation folds multiple times, with division ratio iterating from 0.1 (10% learning, 90% validation) to 1 (100% learning, 0% validation). Learning sets of increasing size were used to evolve meta-learning templates and to produce models by executing templates on the same data. These models were evaluated on the training set and on the testing set producing the validation and test errors. Whereas the test errors are unbiased estimates of model performances, the validation errors gradually translate to the training errors (possibly biased) as the size of the learn set increases. Note that for 100% learn data fraction, full training set is used to evolve templates and build models so the validation error becomes the training error.

We averaged results from 20 repetitions of this setup and plotted the development of errors (see Fig. 6). The level of data overfitting is reasonably low. Even when the same (training) data set is used to evolve the template, build the model and estimate its error, the error is not significantly different from the unbiased estimate computed on independent Testing set. This is mainly due to the fitness function used in the evolution of templates, which favors templates generating models performing well on unseen data. Glyphs summarize numbers of base algorithms and meta-algorithms appearing in evolved templates for each division ratio. For tiny learning data sets (ratio below 0.3) diverse templates were evolved in each repetition, whereas for ratios above 0.8, evolved templates were almost identical.

Fig. 6
figure 6

The difference between test and validation errors is not significant. Glyphs indicate percentages of base algorithms and meta-algorithms in winning templates

Note that this behavior is demonstrated on the Ecoli data set, but can be observed also on other data sets. Below, we experiment with diverse portfolio of small data sets to examine which templates are discovered an if they can be reused on other data sets.

7.3 Templates evolved for various data sets

Templates evolved on data sets (see Table 5) were serialized to a text description representing their internal structure. As you can see in the Table 4, for some data sets trivial templates were evolved (for example the KNN algorithm for Heart and Pendigits), for other data sets a regular ensemble performed best (for example the Boosting of Decision Trees for Segment) and hierarchical templates were the best solution for Vehicle or Texture2, Wine or Breast data sets and others. Note that depicted templates are representatives of final templates selected in multiple runs on benchmarking data sets. In each independent (no metadatabase) run of the evolution on a single data set, final template can differ. Diversity of the final templates may be minimal for some data set and significant for other, however they are very similar in terms of functionality and complexity.

The occurrence of individual algorithms can be counted for evolved templates. Almost 40% of solutions were hierarchical templates, the same percentage contained the ClassifierModel decomposing the classification problem into N regression problems of class probability estimation. It is surprising that regression models are present so often in final classification templates. One possible explanation is that our optimization algorithms for predictive modeling are very efficient and fast. Therefore the evolution can explore many more variants in given time than in case of KNN, Neural nets or other classification algorithms that tend to be slower. It is apparent that ensembles and particularly their hierarchical variants significantly outperform optimized base algorithms for several data sets.

Table 4 Templates evolved on individual data sets serialized into text description

7.4 Similarity and substitutability of templates

The aim of this experiment is to evaluate performance of evolved templates on other data sets to see how universal each template is.

In our contribution (Kordík et al. 2012) we analyzed the similarity of templates in terms of performance on individual data sets. We executed each template on all data sets and measured the performance of generated classifiers. We split each data set randomly into two folds, one is used for learning and the second for evaluating the classifier, then the folds are exchanged. Due to the noise in results, we repeated this procedure 25 times so that each template was evaluated using 25\(\times \) two fold cross-validation on all data sets.

Fig. 7
figure 7

Performances of meta-learning templates on individual data sets visualized as a starplot matrix. Labels of templates in the legend are derived from data sets used to evolve them. For Balance data, the template evolved on Segment data (Tseg) performs worst whereas the template evolved on Balance data (Tbal) performs best—expected behavior

The results summarized in Fig. 7 show that three data sets (Breast, Wine and Texture1) are very easy to classify, no matter which template (algorithm) is used. The set of evolved templates was slightly different than that listed in this paper (Fig. 4). Although we have added some base algorithms and improved global heuristics of the evolution since the last experiments published in Kordík et al. (2012), the winning templates are quite consistent.

There is a group of four data sets (Ecoli, Heart, Ionosphere, Segment), that can be solved by most of the templates except those based on Polynomial models. These models are trained by the Least squares algorithm (Kordík et al. 2010). For certain data (noisy with binary inputs and overlapping instances) the algorithm fails to deliver a solution due to a non-invertible matrix, the parameters of polynomials are set randomly and the result is poor. There are also two complex data sets (Spirals and Spread) that can be solved almost exclusively by their templates and one complex noisy data set (Texture2) where only ensemble (or hierarchical ensemble) of algorithms can deliver satisfactory results. You have probably noticed that for a number data sets a template derived from another data set performs better than the one derived for this specific data set. The differences are not significant and they are caused by noise in the process of selection of template and evaluation of template on the other data set.

Based on these experiments we can conclude that hierarchical templates evolved on particular complex problems (data sets) have often capacity to solve other complex problems very well. This is often the case for complex general-purpose templates containing universal algorithms such as neural nets. On the other hand, some problems (Spirals, Spread) require specific algorithms (KNN, Local polynomial regression). Note that in our previous work (Kordík et al. 2012) the template CascadeGenProb{9\(\times \) ClassifierModel{outputs \(\times \) ExpModel}} was evolved for the Spread problem and it failed to produce good classifiers on the Spiral data set.

When the performance of templates on individual data sets is averaged, we get the “universality” of templates. Templates based on polynomial models are least universal (with 60% average performance). On the other hand, the most universal is the Texture2 template (double stacking of neural nets). With an average performance over 80% on all data sets, the top three templates (Ttexture2, Tspread, Tspirals) contain hierarchical ensembles.

7.5 Evaluation of metadatabase

The motivation of this experiment is to evaluate when the metadatabase should and should not be used. The content and usage of metadatabase is described in Sect. 5.1.5. We run experiments on selected data sets in two configurations (a) metadatabase disabled—initial population generated from minimal form (Base classifiers) and (b) metadatabase enabled.

Fig. 8
figure 8

The improvement in the convergence of the evolution can be observed for all tested data sets when the seeding from the meta database is used. There is a danger of overfitting because the same data sets were already used to evolve very similar data. The bias however should not be high, because of low number of parameters used in the template

The positive influence of seeding the initial population with templates from a metadatabase is demonstrated in Fig. 8. The best solution (template) found in the initial population is far better when the metadatabase is used for all tested data sets. The improvement is bigger for complex tasks, such as the Spiral problem, where hierarchical templates have to be discovered and the evolution of such templates from randomly initialized population takes many generations.

We have to note that the metadata and templates of the data sets tested were excluded form the metadatabase with one exception. For Spirals, there was very similar data set (Spirals + 3 irrelevant attributes) contained and that is why the performance went up rapidly after seeding templates from this data set. This observation motivated us to investigate, how much templates are prone to data overfitting. We stored templates evolved on single data set and then reused it again with the same data. We performed 300 evolutions initialized from the metadatabase (after each evolution the metadatabase was updated when better solution was found).

Fig. 9
figure 9

Templates can overfit the data when the metadatabase is used and the same data set is presented to the system over and over again. The data set here is Borelia (reproduced with permission from Motl (2013))

Apparently, there is a danger of data overfitting when the metadatabase is used. By storing the best template, the data set used to evolve or select the template should not be further used for validation or testing. When the same data set is presented over and over again and each time the best template from the metadatabase is inserted into the initial population of templates, overfitting occurs after dozen of runs as shown in Fig. 9 despite a template does have just a fraction of parameters of a model generated from it. We recommend using metadatabase just in scenarios where we can guarantee enough data to prevent reusing data samples.

8 Benchmarking meta-learning templates on small data

In this section, we evaluate the performance of meta learning templates evolved using the proposed algorithm and compare it to the standard algorithms and ensembles on small data sets described in Sect. 6. We decided to perform experiments in our environment, because we can easily control run-time of algorithms and compare them to RapidMiner classifiers.

The methodology was the following. The training set was used to evolve templates on each data set by means of procedures described above. Then the final template selected for each data set was evaluated on the corresponding test data set by fifty repetitions of ten-fold crossvalidation. The performance of meta-learning templates was compared to the most popular algorithms contained in the RapidMiner software. For each data set, we measured performances of all algorithms with default settings on training data and evaluated the best performing algorithm on the testing set (also 50x tenfold CV).

To improve results of RapidMiner algorithms with default settings, their parameters were optimized on training set with the same evolutionary framework described above (ensembling was disabled). Again, the most successful algorithm with optimized parameters was selected on training data and was evaluated on the testing set.

Table 5 The column Templates lists averaged test performances (classification accuracies in percents) of individual templates, evolved on corresponding training data sets
Table 6 One-leave-out validation performance of selected base models and ensembles on benchmarking data sets

Results of experiments (Table 5) confirm that classifiers generated by meta-learning templates should be of the same or better quality than base classifiers trained by standard single algorithms from RapidMiner. This is due to the fact that evolution of templates starts from the minimal form and all base algorithms (including RapidMiner base models) are examined in the beginning and survive in the population unless significantly better solutions (e.g. hierarchical ensembles) drive them out.

Although the evolution of the best template itself took about hundred times longer to complete (the approx. time was in minutes, compared to learning process of base models, which is measured in miliseconds), the additional computing time given to the base algorithms will not increase their capacity to generate more precise models. Computing time of optimized base models in RapidMiner is comparable to the time spent for template evolution—especially when grid search is used.

Additionally, we have computed the performance of base models in default settings and selected ensembles using one-leave-out crossvalidation. Classification accuracies in Table 6 shows that most of the results were dominated by meta-learning templates despite the fact that in the one-leave-out crossvalidation setup, models can benefit from bigger training sets than in case of 10fold crossvalidation used to obtain results in the Table 5. On average (we use the geometric average which is more robust to outliers), templates outperformed selected simple ensembles in spite of smaller training sets (10 fold CV versus LOOCV).

In our experiments we were not able to compare templates to all possible ensembles of base algorithms. The number of such combinations is so high that heuristic search is needed—and this feature is not available in the RapidMiner. Also we have many base models that are not available in other environments. Meta-learning templates are performing so well also due to ability to build classifier from subtrees of regression models.

9 Templates at scale

The recent rise of big data modeling challenges scalability of predictive modeling algorithms and tools. One obvious approach is to reduce dimensionality and numerosity of data (Borovicka et al. 2012). This approach works in most of the cases because big data often includes similar cases that are redundant. However for some data sets, the performance of predictors increase significantly with growing number of instances used for training. For such data, scalable algorithms (Basilico et al. 2011) and tools (Arora et al. 2015; Meng et al. 2016) have been developed.

Most of these approaches are based on a map-reduce technique (Chu et al. 2007).

In this section, we show, that meta-learning can be also used at scale. Our approach is inspired by van Rijn et al. (2015), where classifier selected on sub-samples work reasonably well on larger data sets. We evolve templates on a subset of 3000 randomly selected instances. Then, evolved template can be executed on full data. When we do not have enough time for the meta-learning template evolution, it is also possible to generate the subset just for computing meta-features. Then we can use a best performing template for the data set with most similar meta-features.

For the template execution we split large data into multiple disjoint subsets and then use the map-reduce paradigm to train multiple instances of the template. Prediction is made by reducing (majority voting) of models generated from templates.

This approach is very similar to bagging except that we do not use the bootstrap sampling.

9.1 Experiments

We have conducted experiments to get an insight into the scalability of several machine learning algorithms from h2o as well as our parallel training of templates. Our motivation is to show that proper algorithm selection is important especially for large data sets and can be often done using a fraction of the data set.

We have chosen two public data sets—HIGGS (Baldi et al. 2014) and Airline Delays which is available through H2O (H2O 2015). Those data sets are used for binomial classification of selected output attributes.

We benchmark our paralelized templates to models available in H2O.ai implemented using the map reduce approach. Generalized Linear Model (Hussami et al. 2015) is using logistic regression to deal with classification problems. Naive Bayes classifier assumes Independence of input attributes and classifies based on conditional probabilities obtained from training data. Deep learning (Arora et al. 2015) is a feedforward neural network with various activation functions in neurons. Distributed Random Forest and Gradient Boosted Machine (Click et al. 2016) are ensembles based on decision trees. H2O Ensemble is an ensemble classifier called Super Learner by LeDell (2016).

The following experiments use 1,000,000 randomly selected rows from each data set. Then 50% rows are randomly selected as a test set and the rest is then sampled to subsets of growing size to examine scalability of algorithms. This sampled data is randomly split to training set (80%) and validation set (20%).

Fig. 10
figure 10

Comparison of several machine learning algorithms in H2O.ai trained on samples with various sizes from Higgs (Baldi et al. 2014) data set

At first, we examined scalability of algorithms on the Higgs data set. Figure 10 shows learning time and performance of individual algorithms executed on subsets of growing size. The best performance was achieved by Deep Learning which was also reasonably fast. Gradient Boosting is faster, but it does not have capacity to improve with bigger data subsets. Distributed Random Forest is also reasonably accurate and fast, but it is dominated by Deep Learning on Higgs. Ensembles produced from templates are not very competitive on this data set. Only complex hierarchical ensemble of decision trees is approaching the performance of Distributed Random Forest, but it is much slower. Our implementation is not optimized for H2O.ai.

Fig. 11
figure 11

Predicting IsArrDelayed on Airline data set: comparison of algorithms in H2O.ai trained on subsamples of increasing size

Looking at the Fig. 11, where arrival delay is predicted on the Airlines data set, results are completely different. Our ensembles are both more accurate and faster. The difference is so big, that we decided to analyze these results further.

We even simplified the prediction task by predicting the departure time without removing the DepTime attribute.

Fig. 12
figure 12

Predicting IsDepDelayed on Airline data set: comparison of algorithms in H2O.ai trained on subsamples of increasing size. Left subfigure shows that the performance of our templates for small data samples is significantly higher than that of other algorithms. Interesting observation is that deep learning needs almost 100k training instances to match the performance of simple template trained on 10k dataset. Also, when it comes to training times, differences among algorithms are huge (right subfigure). It takes almost 20 min to train H2O Ensemble on this task, whereas Sigmoid template is trained in few seconds

The prediction problem then becomes quite trivial, because you can obtain the target (is departure delayed?) by comparing DepTime and CRSDepTime attribute. It is quite surprising that most of the classifiers are mislead by other attributes and fail to discover this simple relationship.

Figure 12 shows that again our simple ensembles based on Sigmoidal model are able to learn fast and solve the problem even on small subsets. H2O Ensemble and Deep Learning discovered the relationship on 500 thousand instances and their learning time was significantly higher.

Fig. 13
figure 13

Hyperparameter search using neither Random Search nor SMAC is beneficial for H2O models. In most of the cases, the performance is worse than for algorithms in default settings

To ensure that the problem is not caused by improper parameter settings, we run optimization of parameters on a subset of 100 thousand instances. The list of parameters and their ranges are available [Software: Algorithmic templates for h2o.ai (https://github.com/kordikp)]. Figure 13 shows that most of the H2O algorithms are very sensitive to improper parameter settings. Deep learning was able to converge in default parameter setting only, our assumption is that parameters are controlled adaptively by default. Similarly, negative impact was observed for Generalized Linear Model. For Gradient Boosting and Distributed Random Forest, optimization discovered better performing configuration, however the difference was not significant. We also optimized number of models in our hierarchical ensembles but apparently it had almost no effect on performance. The Decision Tree based ensemble was unable to solve the task in any configuration which is consistent with poor performance of DT based ensembles from H2O. On the other hand the Sigmoid based ensemble was able to discover the relationship even with minimal number of models in the ensemble which is consistent with previous experiments. From boxplots and distribution of individual results (red dots) the Bayesian Optimization (SMAC) method outperformed the Random search.

Fig. 14
figure 14

Decision boundaries of algorithms on problem of predicting aircraft departure delay. Simple ensemble of sigmoid classifiers was able to generalize the relationship well, whereas decision tree based ensembles overfitted the data. Deep Learning discovered the relationship only on large data samples. Note that plots are showing the behaviour of classifiers just in two dimensional plane of the multidimensional input space. Attributes DepTime and CRSDepTime were however the most important dimensions

Plots of class probabilities and decision boundaries helped us to reveal the reason of poor performance of decision tree based ensembles. Figure 14 shows that successful classifiers (ensemble of sigmoid models, Deep Learning) were able to identify simple relation of two input attributes to departure delay prediction. The relationship (decision boundary) is hard for decision trees to model with their orthogonal decisions. It is also impossible to solve for Naive Bayes classifier assuming independence of input attributes.

Apparently, we were able to discover very efficient template for this trivial problem. We believe that our approach can contribute to evolve (discover) templates for diverse data sets and predictive tasks. Building library of algorithmic templates can improve capacity of predictive modeling systems to solve diverse tasks efficiently.

10 Conclusions

In this article, we propose to optimize topology and parameters of algorithmic ensembles by an evolutionary algorithm. We show that useful algorithms can be discovered on small data sub-samples and later applied to large data sets.

We use meta-learning templates to describe set of algorithmic ensembles and examine their performance on several benchmarking problems.

The generalization accuracy of classifiers generated using these templates are capable of outperforming classifiers produced by the most popular data mining algorithms.

We found out that templates are prone to data overfitting in spite of very low number of their parameters. One needs to be aware of this issue especially when a metadatabase is employed. Data samples that were used to select best template in previous runs cannot be reused for unbiased estimate of the generalization performance. Metadatabase can be however very useful in speeding up the convergence when enough data is available for independent model validation.

We show how templates can be scaled up for large data sets modeling using the map-reduce approach. Benchmarks revealed that our approach is able to produce algorithms competitive with state of the art approaches for large scale predictive modeling. Ensembles of simple regression models can outperform popular algorithms in both generalization ability and scalability as demonstrated on the Airlines data set.