Automated Algorithm Selection: from Feature-Based to Feature-Free Approaches

Alissa, Mohamad; Sim, Kevin; Hart, Emma

doi:10.1007/s10732-022-09505-4

Automated Algorithm Selection: from Feature-Based to Feature-Free Approaches

Open access
Published: 09 January 2023

Volume 29, pages 1–38, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Heuristics Aims and scope Submit manuscript

Automated Algorithm Selection: from Feature-Based to Feature-Free Approaches

Download PDF

3054 Accesses
8 Citations
Explore all metrics

Abstract

We propose a novel technique for algorithm-selection, applicable to optimisation domains in which there is implicit sequential information encapsulated in the data, e.g., in online bin-packing. Specifically we train two types of recurrent neural networks to predict a packing heuristic in online bin-packing, selecting from four well-known heuristics. As input, the RNN methods only use the sequence of item-sizes. This contrasts to typical approaches to algorithm-selection which require a model to be trained using domain-specific instance features that need to be first derived from the input data. The RNN approaches are shown to be capable of achieving within 5% of the oracle performance on between 80.88 and 97.63% of the instances, depending on the dataset. They are also shown to outperform classical machine learning models trained using derived features. Finally, we hypothesise that the proposed methods perform well when the instances exhibit some implicit structure that results in discriminatory performance with respect to a set of heuristics. We test this hypothesis by generating fourteen new datasets with increasing levels of structure, and show that there is a critical threshold of structure required before algorithm-selection delivers benefit.

Learning the Quality of Dispatch Heuristics Generated by Automated Programming

Deep Learning as a Competitive Feature-Free Approach for Automated Algorithm Selection on the Traveling Salesperson Problem

A RNN-Based Hyper-heuristic for Combinatorial Problems

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Algorithm-selection—the process of selecting the best algorithm to solve a given problem instance—is motivated by the potential to exploit the complementary performance of different algorithms on sets of diverse problem instances. However, determining the best-performing algorithm for an unseen instance has been shown to be a complex problem that has attracted much interest from researchers over the decades (Kotthoff 2016; Kerschke et al. 2018a; Smith-Miles 2009). A common approach to tackling the Algorithm-Selection Problem (ASP) is to treat it as a classification problem where each instance is described in terms of a vector of hand-designed features, and an instance’s class indicates the best performing algorithm. Although there have been a number of successful studies using this method, e.g. Perez et al. (2004), Kandanaarachchi et al. (2018), Collautti et al. (2013), Kerschke and Trautmann (2019), the task of identifying appropriate features that correlate to algorithm performance is far from trivial in many domains: in some domains, specifying features is not intuitive, and it can be difficult to create a sufficient number to train a model, while in others in which there are many features, it is necessary to invoke feature-selection methods in order to choose appropriate features (Kerschke et al. 2018b; Smith-Miles et al. 2014) as the noisy and uninformative features prevent the selection techniques making intelligent decisions (Loreggia et al. 2016).

Feature-design is even more complex in domains in which the data has sequential characteristics. For example, in online bin-packing (Lee and Lee 1985; Ramanan et al. 1989) and online job-shop scheduling problems (Weckman et al. 2008; Liu et al. 2009), items/tasks arrive in a stream (one at a time) and have to be packed/assigned to a container/machine exactly in the sequence that they arrive. In such cases, it would be appropriate to derive features that capture the sequential information contained in the sequence in order to be informative, but deriving such features is even more challenging than in the cases mentioned above.

One solution to dealing with sequential data can be found in the field of deep learning, where the use of recurrent neural networks with Long-Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) or Gated Recurrent Units (GRU) (Cho et al. 2014) to classify sequential data has become widespread in recent years; example applications include text classification (Lee and Dernoncourt 2016), scene-labelling (Byeon et al. 2015) and time-series classification (Karim et al. 2017). Such networks directly use a sequence of data as input (e.g the size of the next item to be packed in bin-packing). In this sense they are ‘feature-free’ in that it is not necessary to derive auxiliary features from the data to train the network. The addition of the LSTM/GRU to the network enables a model to learn the long-term context or dependencies between symbols present in an input sequence, and also handles variable-length sequences of information. Therefore, we propose that an RNN-LSTM or RNN-GRU could be used as a feature-free classification technique to perform algorithm-selection in optimisation domains in which there is sequential data.^{Footnote 1} To be clear, in the context of this paper, the term feature-free refers to the use of raw input data defining a problem instance as input to a selector, where there is no pre-processing of data required or a need to define and derive features a-priori from the problem data. This therefore addresses the associated issues outlined above. Also, we restrict the algorithm space to a set of deterministic heuristics, however, we use the phrase “algorithm-selection” as it is the familiar phrase of this problem in the literature.

In a recent conference paper (Alissa et al. 2019), we described an initial implementation of an RNN-LSTM to perform algorithm-selection in the field of 1D bin-packing, showing that it was able to outperform the Single-Best Solver (SBS) (i.e. the single heuristic that achieves the best performance over the instance set) on multiple datasets and achieving comparable performance to the Virtual-Best Solver (VBS), i.e. the oracle. Our approach has subsequently been adopted by (Seiler et al. 2020) and modified to work in the TSP domain. Here we extend our previous work by proposing an additional neural architecture for prediction that uses gated-recurrent units. In addition, we compare the two RNN architectures to six different classical machine learning techniques that use derived features as input to provide better insight into the relative merits of recurrent vs classical networks. An extensive evaluation is conducted using five benchmark datasets. In order to understand the conditions under which the proposed methods are likely to be useful, we then conduct a systematic investigation over fourteen newly created datasets that exhibit increasing levels of structure within the data, using a proxy for structure defined in terms of the performance difference between heuristics on the same instance. This sheds new light on why some spaces are more likely to facilitate classification than others.

The contributions are as follows:

A novel feature-free algorithm-selection approach using a recurrent neural network with either long-short-term memory or gated recurrent units that avoids the need to identify features through training, using only the sequential information defining a problem instance.
An extensive comparison of the feature-free approaches to feature-based approaches, using a wide range of features as input to six well-known classifiers, each tuned to ensure optimised performance.
A systematic investigation of the relationship between classification performance and the level of structure in the dataset, using fourteen newly developed datasets that exhibit controllable levels of structure.

Results show that on the five benchmark datasets, the feature-free approach significantly outperforms the best feature-based approaches, with classification accuracies across the five datasets that are very close to the theoretical optimum. We further find that there is a threshold for ‘structure’, below which algorithm-selection might be unnecessary. Given that many real-world problems are known to be structured Smith-Miles (2009); Hains et al. (2011) this underlines the need to develop better algorithm-selection methods to be used in practice.

The rest of the paper is organised as following, Sect. 2 explains the different approaches of dealing with ASP from the literature. A brief summary of Online 1D Bin-Packing Problem (1D-BPP) and the heuristics used in this study are presented in Sect. 3. Section 4 shows the problem instances we use in this research. We describe the Deep Learning and the Machine Learning models (including the features used) in Sect. 5. This research methodology is explained in Sect. 6. The results are presented in Sect. 7 followed by a systematic analysis across multiple datasets in Sect. 8.

2 Related work

Originally formulated by Rice (1976), the per-instance Algorithm-Selection Problem (ASP) can be defined as:

“Given a set I of instances of a problem P, a set $a = \{a_1, \ldots ,a_n\}$ of algorithms for P and a metric $m : a \times I \rightarrow $ ${\mathbb {R}}$ that measures the performance of any algorithm $a_j \in A$on instance set I, construct a selector S that maps any problem instance $i \in I$ to an algorithm $S(i) \in A$ such that the overall performance of S on I is optimal according to metric m. (Kerschke et al. 2018a)

A schematic of the ASP is shown in Fig. 1 (Rice 1976). In Rice’s definition:

The problem space ${\mathcal {P}}$ represents a potentially infinite sized set of instances for the problem domain
The feature space ${\mathcal {F}}$ describes a set of characteristics derived using feature extraction from ${\mathcal {P}}$
The algorithm space ${\mathcal {A}}$ is the set of algorithms available for the problem domain
The performance space ${\mathcal {Y}}$ maps each algorithm to a set of performance metrics (Smith-Miles et al. 2014).

The objective is to identify a mapping between ${\mathcal {P}}$ and ${\mathcal {A}}$ that maximises ${\mathcal {Y}}$. In this paper, we restrict the algorithm space to a set of deterministic 1D-BPP heuristics.

For a finite set of problem instances I, a fixed set of heuristics H and a single performance metric m, the Virtual Best Solver (VBS) is defined as a perfect mapping between I and H. The Single Best Solver (SBS) is the heuristic $\in H$ that achieves the best performance over I. Although Rice’s framework is a useful approach for describing ASP, it provides no advice about the mapping from problem space ${\mathcal {P}}$ to the feature space ${\mathcal {F}}$, and it clearly shows that the effectiveness of the algorithm-selection process for solving a particle problem domain relies on the quality of the problem’s features (Smith-Miles and van Hemert 2011; Smith-Miles and Lopes 2012). A comprehensive review of different approaches towards algorithm-selection can be found in a number of survey papers in this active area of research (Kerschke et al. 2018a; Kotthoff 2016; Smith-Miles 2009). Some of the most relevant approaches are described here.

2.1 ASP with feature-based approaches

One of the most common approaches to ASP is to identify features using expert knowledge, and then train machine learning methods to predict the best performing algorithm(s) for an instance from its feature-profile (Kerschke et al. 2018a). However, identifying features that are significant in determining performance is complex, usually requires hand-crafting (Nudelman et al. 2004; Hutter et al. 2014; Smith-Miles and van Hemert 2011; Pihera and Musliu 2014), and often is not intuitive. Often the approach must also be combined with a feature-reduction method to simplify learning, e.g Principal Component Analysis (PCA) (Smith-Miles et al. 2014; López-Camacho et al. 2013), and understand the correlations between features and algorithm performance.

Cruz-Reyes et al. (2012) have used meta-learning and hyper-heuristics to solve ASP in the domain of 1D-BPP. Their methodology relies on data collected from past experience to characterise algorithm performance and it is divided into three phases: initial training, prediction, and training with feedback. The output of the training phase is a trained model that relates the problem characteristics to the algorithms’ performance—this model is used to predict the best algorithm for a new given instance in the prediction phase. The new solved instances are then incorporated into the knowledge base to improve the selection quality. They used five deterministic heuristics and two non-deterministic algorithms with 1D-BPP. Three machine learning methods are compared—Discriminant Analysis (DA) (Pérez et al. 2004), a decision tree to build the selectors and a Self-Organizing Map (SOM) (Haykin et al. 2009) to implement the selection system with feedback. Five features were used as input. Their method obtained 76% accuracy with DA and 81% accuracy with decision tree to select the best algorithm. Also, the accuracy increased from 78.8% with initial-training up to 100% when using SOM with feedback and the number of problem characteristics was the minimum.

López-Camacho et al. (2013) also studied ASP in the packing domain using a wide range of 23 features and six heuristics within an evolutionary hyper-heuristic framework. They studied the correlation between the structure of 1D- and 2D-BPP instances and the performance of the solvers using PCA (Ringnér 2008) as a knowledge discovery method. Most of the used features are related to 2D-BPP and a subset of nine features, including means and standard deviation (std) of the item sizes, are considered that is strongly correlated with the heuristics performance after the feature-reduction. They analysed the distribution of feature values across the PCA map and their analysis suggested that there are indeed correlations between instance characteristics and heuristic performance. Brownlee et al. (2018) have used ten BPP features that are related to the distribution of item sizes within each instance and performance features to analyse the relationship between the training data and automatic design of algorithms. They investigated the distributions of values for features over the instances in benchmark sets, and how these distributions relate to the performance of algorithms built by automatic design of algorithms. They concluded that high variation in some of these features, including mean, standard deviation and maximum of item size, is a strong indicator for good fitting to the training instances and to achieve good performance for automatic design of algorithms.

A different type of approach was proposed by Ross et al. (2002) that could be used for ASP with constructive approaches to solution generation; rather than deriving features from the original description of an instance, a small number of features were derived from the current instance state each time a heuristic was applied. A learning classifier system XCS (Wilson 1995) was used to map a set of problem-states to specific heuristics. An approach that tries to avoid having to hand-craft good features was described in Sim et al. (2012) who evolve the parameters of a feature design method for 1D bin-packing problems to that best improve the performance of k-Nearest Neighbours (KNN) classifier (Shalev-Shwartz and Ben-David 2014).

Another ASP approach that does not explicitly rely on feature identification and extraction was proposed by Sim et al. (2015). Here, a system continuously generates novel heuristics which are maintained in an ensemble, and samples multiple problem instances from the environment. Heuristics that “win” an instance (perform best) are maintained. This was shown to rapidly produce solutions and generalise over the problem space, but required a greedy method of actually selecting between generated heuristics and hence does not fit with the classical ASP definition.

2.2 ASP with streaming problems

Although feature-based approaches have been shown to work well in domains in which there is no sequential information associated with an instance description,^{Footnote 2} domains in which data arrives in a continual stream are more challenging. Statistical approaches to defining features for streaming data are complex, and developing algorithm selectors to tackle streaming data poses considerable challenges due to potentially large streams, the fact that the order of data points cannot be influenced and that the underlying distribution of the data points in the stream can change over time. A recent survey article describing the state-of-the-art in algorithm-selection (Kerschke and Trautmann 2019) highlighted a pressing need to develop automated algorithm-selection methods that are capable of learning in the context of streaming data. A supervised-learning approach was used by van Rijn et al. (2018, 2014) to predict which classifier performs best on a (sub)stream. Unsupervised learning approaches such as stream-clustering have been used to identify, track and update clusters over time (Carnein et al. 2017; Gong et al. 2017). However, due to the huge space of parameter and algorithm combinations, clear guidelines on how to set and adjust them over time are lacking (Mansalis et al. 2018; Carnein and Trautmann 2019; Amini et al. 2014).

2.3 ASP with deep learning approaches

Recently, deep learning algorithms have gained some traction in the ASP field due their ability to learn from extremely large datasets in reasonable time. Mao et al. (2017) proposed a heuristic performance predictor using a deep neural network trained on a large set of instances of variable sized 1D bin-packing problem using 16 features as input, grouped into item, box and cross features. Their prediction system has achieved up to 72% validation accuracy to select the best performing heuristic that can generate a better quality bin-packing solution. To eliminate the arduous task of manually designing features, Loreggia et al. (2016) proposed a deep learning approach to automatically derive features in SAT and CSP domains assuming that any problem instance can be expressed as a text document. Unlike previous works e.g. Smith-Miles et al. (2014), López-Camacho et al. (2013) that derive features from features automatically using PCA, their approach automatically derives features from a visual representation of the problem instances (i.e. converting the text files into grey-scale square images), which can be used to train a conventional neural network to predict the best solver for the instance. Although their approach obtained better results than the SBS, it was not able to outperform over the approaches that use regular manually crafted features. Although concerned with learning an optimisation method rather than algorithm-selection, Hu et al. (2017) used a deep reinforcement learning (a Pointer Network), with 3D-BPP to optimize the sequence of items to be packed into the bin by choosing the sequence, orientation and empty maximal space to pack cuboid shaped items. They claimed that their proposed method has obtained about 5% improvement over a well-designed heuristic.

As mentioned in the introduction, in a recent conference paper, Seiler et al. (2020) adopted and adapted the LSTM approach we proposed in Alissa et al. (2019) to be applicable to the Euclidean TSP domain, also using an evolved (and balanced) dataset (1000 instances) with two TSP solvers. They compared a feature-based approach using four different classical ML classifiers to a feature-free approach using deep learning Convolutional Neural Networks (CNNs). Due to the large TSP-related feature sets, they conducted a data analysis and automatic feature selection to choose adequate set of 15 most relevant features. Their results show that the feature-based approach improved over the SBS performance but still quite far away from the performance of the oracle-like VBS. The feature-free approach matches the performance of the quite complex classical ML approaches, despite being solely based on raw visual representation of the TSP instances. Although TSP is not an online or sequential problem, the work of Seiler et al. (2020) borrows the key concept of our proposed method, i.e. that the raw data defining an instance can be used without modification as input to a selection algorithm.

The approach proposed in this paper differs substantially from the previous work just described in that it abandons the need to derive features from a dataset, circumventing the associated issues. In contrast to some previous research which extends Rice’s diagram to encapsulate a broader agenda relating to the relative power of algorithms (e.g. Smith-Miles et al. 2014), our proposed method shrinks Rice’s diagram through bypassing the feature extraction block. Furthermore, as far as we are aware, it provides the first example of applying a recurrent-neural network as an algorithm-selector to data which has sequential characteristics. Although such networks have demonstrated ground-breaking performance on a variety of tasks that include image captioning, language translation and handwriting recognition (Lipton et al. 2015), their applicability has not been exploited within the ASP domain.

3 Online 1D-BPP: definition and heuristics

The objective of the general 1D Bin-Packing Problem (1D-BPP) is to find a packing which minimises the number of containers, b, of fixed capacity c required to accommodate a set of n items with weights $\omega _j: j \in \{1, \ldots , n\}$ falling in the range $1 \le \omega _j \le c, \omega _j \in {\mathbb {Z}}^{+}$ whilst enforcing the constraint that the sum of weights in any bin does not exceed the bin capacity c. The lower and upper bounds on b, $(b_l$ and $b_u)$ respectively, are given by Equation 1. Any heuristic that does not return empty bins will produce, for a given problem instance, p, a solution using $b_p$ bins where $b_l \le b_p \le b_u$.

$$\begin{aligned} b_{l} = \left\lceil \frac{1}{c} \sum \limits _{j = 1}^{n} \omega _j \right\rceil ,~ b_{u} = n \end{aligned}$$

(1)

In online bin-packing, items arrive in a stream, one at a time, and must be packed in the order that they arrive. In the specific version that we consider here, all items to be packed are known before packing starts (i.e. they constitute a fixed length batch) but the order that items in the batch are presented to the packing heuristics is fixed and cannot be changed. In contrast to other types of packing problem, the sequence cannot be re-ordered to find an ordering that provides an optimal packing with respect to a given heuristic. The function of the algorithm-selection method is therefore to select a heuristic to apply to pack the entire batch, considering the items in the fixed order given.

There have been numerous studies over the decades that have investigated the performance of simple approximation algorithms for the online variant of the BPP (Johnson et al. 1974; Delorme et al. 2016). We select 4 simple approximation algorithms from the literature specifically designed for this variation of the BPP (Garey and Johnson 1981) in order to evaluate the proposed algorithm-selection methods:

First fit (FF) Places each item into the first feasible bin that will accommodate it.
Best fit (BF) Places each item into the feasible bin that minimises the residual space.
Worst fit (WF) Places each item into the feasible bin with the most available space.
Next fit (NF) Places each item into the current bin.

For all the algorithms listed, if no feasible bin is available to accommodate the next item then it is placed into a newly opened bin. NF is different to the other 3 algorithms in that it only ever considers the most recently opened bin. If an item cannot fit in the current bin that bin is closed and removed from the problem.

The performance of an algorithm A on instance ${\mathcal {I}}$ is denoted by $A({\mathcal {I}})$. $OPT({\mathcal {I}})$ is the optimal solution for that instance. The worst-case performance ratio (WCPR) of A is defined as the smallest real number $r(A) > 1$ such that $\frac{A({\mathcal {I}})}{OPT ({\mathcal {I}})} \le r(A)$ for all possible instances. The WCPR of NF is known to be 2 (Delorme et al. 2016) and it was recently concluded after many theoretical studies that the WCPR of FF and BF is $\frac{17}{10}$ (Dósa and Sgall 2014).

4 Problem instances

We use a set of benchmark BPP instances that were first introduced in Alissa et al. (2019). These benchmarks consists of four balanced datasets: each dataset has 4000 instances, and contains exactly 1000 instances uniquely solved best by each of the four heuristics described in Sect. 3. The datasets were generated using an Evolutionary Algorithm which maximises the difference in a function f between the target algorithm and the other algorithms used in the selection problem, where f is Falkenauer’s fitness function (Falkenauer and Delchambre 1992) given in equation 2 and are described in detail in Alissa et al. (2019).

$$\begin{aligned} Fitness = \frac{1}{b} \sum _{i=1}^{b}\left( {\frac{fill_i}{C}}\right) ^{k} \end{aligned}$$

(2)

Each instance in each dataset is labelled with the heuristic that provides the best result according to equation 2. This metric is commonly used to gauge the quality of a solution produced by an bin-packing algorithm and returns a value between 0 and 1. In the original datasets described in Alissa et al. (2019), k is fixed at 2. C is the bin capacity which is fixed at 150, $fill_i$ is the sum of the item sizes in $bin_i$ and b is the number of bins used.

A new dataset is created by combining instances selected from all 4 datasets just described (identified as DS5). This facilitates an investigation into whether the feature-based and feature-free models generalise across a mixed set of instances of different lengths with item weights drawn from different probability distributions and bounds. DS5 contains 4000 instances with 1000 instances selected from each dataset DS1-DS4. For each dataset, 250 instances were selected at random for each class (FF, BF, WF and NF), resulting in a balanced dataset containing equal numbers of instances from each class and each distribution. Table 1 provides a description of each dataset. These datasets are available for other researchers working in the field of ASP to compare approaches.^{Footnote 3}

Table 1 Dataset parameters

Automated Algorithm Selection: from Feature-Based to Feature-Free Approaches

Abstract

Similar content being viewed by others

Learning the Quality of Dispatch Heuristics Generated by Automated Programming

Deep Learning as a Competitive Feature-Free Approach for Automated Algorithm Selection on the Traveling Salesperson Problem

A RNN-Based Hyper-heuristic for Combinatorial Problems

Explore related subjects

1 Introduction

2 Related work

2.1 ASP with feature-based approaches

2.2 ASP with streaming problems

2.3 ASP with deep learning approaches

3 Online 1D-BPP: definition and heuristics

4 Problem instances

5 Models: a deep learning model and a set of classical machine learning models for algorithm-selection

5.1 Deep learning feature-free model

5.2 Feature-based machine learning models

5.3 Definition of features

6 Methodology

7 Results

7.1 Accuracy of algorithm-selection

7.2 Comparison to SBS and VBS

7.3 Evaluation of solution quality

8 A systematic analysis across multiple datasets

8.1 Generating increasingly structured instances

8.2 Training and evaluation

8.3 Analysis and discussion

9 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/Competing interests

Availability of data and material

Code availability

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation