Abstract
Metalearning, or learning to learn, is a machine learning approach that utilizes prior learning experiences to expedite the learning process on unseen tasks. As a datadriven approach, metalearning requires metafeatures that represent the primary learning tasks or datasets, and are estimated traditonally as engineered dataset statistics that require expert domain knowledge tailored for every metatask. In this paper, first, we propose a metafeature extractor called Dataset2Vec that combines the versatility of engineered dataset metafeatures with the expressivity of metafeatures learned by deep neural networks. Primary learning tasks or datasets are represented as hierarchical sets, i.e., as a set of sets, esp. as a set of predictor/target pairs, and then a DeepSet architecture is employed to regress metafeatures on them. Second, we propose a novel auxiliary metalearning task with abundant data called dataset similarity learning that aims to predict if two batches stem from the same dataset or different ones. In an experiment on a largescale hyperparameter optimization task for 120 UCI datasets with varying schemas as a metalearning task, we show that the metafeatures of Dataset2Vec outperform the expert engineered metafeatures and thus demonstrate the usefulness of learned metafeatures for datasets with varying schemas for the first time.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Metalearning, or learning to learn, refers to any learning approach that systematically makes use of prior learning experiences to accelerate training on unseen tasks or datasets (Vanschoren 2018). For example, after having chosen hyperparameters for dozens of different learning tasks, one would like to learn how to choose them for the next task at hand. Hyperparameter optimization across different datasets is a typical metalearning task that has shown great success lately (Bardenet et al. 2013; Wistuba et al. 2018; Yogatama and Mann 2014). Domain adaptation and learning to optimize are other such metatasks of interest (Finn et al. 2017; Rusu et al. 2018; Finn et al. 2018).
As a datadriven approach, metalearning requires metafeatures that represent the primary learning tasks or datasets to transfer knowledge across them. Traditionally, simple, easy to compute, engineered (Edwards and Storkey 2017a) metafeatures, such as the number of instances, the number of predictors, the number of targets (Bardenet et al. 2013), etc., have been employed. More recently, unsupervised methods based on variational autoencoders (Edwards and Storkey 2017b) have been successful in learning such metafeatures. However, both approaches suffer from complementary weaknesses. Engineered metafeatures often require expert domain knowledge and must be adjusted for each task, hence have limited expressivity. On the other hand, metafeature extractors modeled as autoencoders can only compute metafeatures for datasets having the same schema, i.e. the same number, type, and semantics of predictors and targets.
Thus to be useful, metafeature extractors should meet the following four desiderata:
D1. Schema Agnosticism The metafeature extractor should be able to extract metafeatures for a population of metatasks with varying schema, e.g., datasets containing different predictor and target variables, also having a different number of predictors and targets.
D2. Expressivity The metafeature extractor should be able to extract metafeatures for metatasks of varying complexity, i.e., just a handful of metafeatures for simple metatasks, but hundreds of metafeatures for more complex tasks.
D3. Scalability The metafeature extractor should be able to extract metafeatures fast, e.g., it should not require itself some sort of training on new metatasks.
D4. Correlation The metafeatures extracted by the metafeature extractor should correlate well with the metatargets, i.e., improve the performance on metatasks such as hyperparameter optimization.
In this paper, we formalize the problem of metafeature learning as a step that can be shared between all kinds of metatasks and asks for metafeature extractors that combine the versatility of engineered metafeatures with the expressivity obtained by learned models such as neural networks, to transfer metaknowledge across (tabular) datasets with varying schemas (Sect. 3).
First, we design a novel metafeature extractor called Dataset2Vec, that learns metafeatures from (tabular) datasets of a varying number of instances, predictors, or targets. Dataset2Vec makes use of representing primary learning tasks or datasets as hierarchical sets, i.e., as a set of sets, specifically as a set of predictor/target pairs, and then uses a DeepSet architecture (Zaheer et al. 2017) to regress metafeatures on them (Sect. 4).
As metatasks often have only a limited size of some hundred or thousand observations, it turns out to be difficult to learn an expressive metafeature extractor solely endtoend on a single metatask at hand. We, therefore, second, propose a novel metatask called dataset similarity learning that has abundant data and can be used as an auxiliary metatask to learn the metafeature extractor. The metatask consists of deciding if two subsets of datasets, where instances, predictors, and targets have been subsampled, socalled multifidelity subsets (Falkner et al. 2018), belong to the same dataset or not. Each subset is considered an approximation of the entire dataset that varies in degree of fidelity depending on the size of the subset. In other words, we assume a dataset is similar to a variant of itself with fewer instances, predictors, or targets (Sect. 5).
Finally, we experimentally demonstrate the usefulness of the metafeature extractor Dataset2Vec by the correlation of the extracted metafeatures with metatargets of interesting metatasks (D4). Here, we choose hyperparameter optimization as the metatask (Sect. 6).
A way more simple, unsupervised plausibility argument for the usefulness of the extracted metafeatures is depicted in Fig. 1 showing a 2D embedding of the metafeatures of 2000 synthetic classification toy datasets of three different types (circles/moon/blobs) computed by (a) two sets of engineered dataset metafeatures: MF1 (Wistuba et al. 2016) and MF2 (Feurer et al. 2015) (see Table 3); (b) a stateoftheart model based on variational autoencoders, the Neural Statistician (Edwards and Storkey 2017b), and (c) the proposed metafeature extractor Dataset2Vec. For the 2D embedding, multidimensional scaling has been applied (Borg and Groenen 2003) on these metafeatures. As can be clearly seen, the metafeatures extracted by Dataset2Vec allow us to separate the three different dataset types way better than the other two methods (see Sect. 6.3 for further details).
To sum up, in this paper we make the following key contributions:

1.
We formulate a new problem setting, metafeature learning for datasets with varying schemas.

2.
We design and investigate a metafeature extractor, Dataset2Vec, based on a representation of datasets as hierarchical sets of predictor/target pairs.

3.
We design a novel metatask called dataset similarity learning that has abundant data and is therefore wellsuited as an auxiliary metatask to train the metafeature extractor Dataset2Vec.

4.
We show experimentally that using the metafeatures extracted through Dataset2Vec for the hyperparameter optimization metatask outperforms the use of engineered metafeatures specifically designed for this metatask.
2 Related work
In this section, we attempt to summarize some of the topics that relate to our work and highlight where some of the requirements mentioned earlier are (not) met.
Metafeature engineering Metafeatures represent measurable properties of tasks or datasets and play an essential role in metalearning. Engineered metafeatures can be represented as simple statistics (Reif et al. 2014; Segrera et al. 2008) or even as modelspecific parameters with which a dataset is trained (Filchenkov and Pendryak 2015) and are generally applicable to any dataset, schemaagnostic D1. In addition to that, the nature of these metafeatures makes them scalable (D3), and thus can be extracted without extra training. For example, the mean of the predictors can be estimated regardless of the number of targets. However, coupling these metafeatures with a metatask is a tedious process of trialanderror, and must be repeated for every metatask to find expressive (D2) metafeatures with good correlation (D4) to the metatarget.
Metafeature learning, as a standalone task, i.e. agnostic to a predefined metatask, to the best of our knowledge, is a new concept, with existing solutions bound by a fixed dataset schema. Autoencoder based metafeature extractors such as the neural statistician (NS) (Edwards and Storkey 2017b) and its variant (Hewitt et al. 2018) propose an extension to the conventional variational autoencoder (Kingma and Welling 2014), such that the item to be encoded is the dataset itself. Nevertheless, these techniques require vast amounts (Edwards and Storkey 2017b) of data and are limited to datasets with similar schema, i.e. not schemaagnostic (D2).
Embedding and metric learning approaches aim at learning semantic distance measures that position similar highdimensional observations within proximity to each other on a manifold, i.e. the metafeature space. By transforming the data into embeddings, simple models can be trained to achieve significant performance (Snell et al. 2017; Berlemont et al. 2018). Learning these embeddings involves optimizing a distance metric (Song et al. 2016) and making sure that local feature similarities are observed (Zheng et al. 2018). This leads to more expressive (D2) metafeatures that allow for better distinction between observations.
MetaLearning is the process of learning new tasks by carrying over findings from previous tasks based on defined similarities between existing metadata. Metalearning has witnessed great success in domain adaptation, learning scalable internal representations of a model by quickly adapting to new tasks (Finn et al. 2017, 2018; Yoon et al. 2018). Existing approaches learn generic initial model parameters through sampling tasks from a taskdistribution with an associated train/validation dataset. Even within this line of research, we notice that learning metafeatures helps achieve stateoftheart performances (Rusu et al. 2018), but do not generalize beyond dataset schema (Achille et al. 2019; Koch et al. 2015). However, potential improvements have been shown with schemaagnostic model initialization (Brinkmeyer et al. 2019). Nevertheless, existing metalearning approaches result in taskdependent metafeatures, and hence the metafeatures only correlate (D4) with the respective metatask.
We notice that none of the existing approaches that involve metafeatures fulfills the complete list of desiderata. As a proposed solution, we present a novel metafeature extractor, Dataset2Vec, that learns to extract expressive (D2) metafeatures directly from the dataset. Dataset2Vec, in contrast to the existing work, is schemaagnostic (D1) that does not need to be adjusted for datasets with different schema. We optimize Dataset2Vec by a novel dataset similarity learning approach, that learns expressive (D3) metafeatures that maintain interdataset and intradataset distances depending on the degree of dataset similarities. Finally, we demonstrate the correlation (D4) between the metafeatures and unseen metatasks, namely hyperparameter optimization, as compared to engineered metafeatures.
3 Problem setting: metafeature learning
A (supervised) learning task is usually understood as a problem to find a function (model) that maps given predictor values to likely target values based on past observations of predictors and associated target values (dataset). Many learning tasks depend on further inputs besides the dataset, e.g., hyperparameters like the depth and width of a neural network model, a regularization weight, a specific way to initialize the parameters of a model, etc. These additional inputs of a learning task often are found heuristically, e.g., hyperparameters can be found by systematic grid or by random search (Bergstra and Bengio 2012), model parameters can be initialized by random normal samples, etc. From a machine learning perspective, finding these additional inputs of a learning task can itself be described as a learning task: its inputs are a whole dataset, its output is a hyperparameter vector or an initial model parameter vector. To differentiate these two learning tasks we call the first task, to learn a model from a dataset, the primary learning task, and the second, to find good hyperparameters or a good model initialization, the metalearning task. Metalearning tasks are very special learning tasks as their inputs are not simple vectors like in traditional classification and regression tasks, nor sequences or images like in timeseries or image classification, but themselves whole datasets.
To leverage standard vectorbased machine learning models for such metalearning tasks, their inputs, a whole dataset, must be described by a vector. Traditionally, this vector is engineered by experts and contains simple statistics such as the number of instances, the number of predictors, the number of targets, the mean and variance of the mean of the predictors, etc. These vectors that describe whole datasets and that are the inputs of the metatask are called metafeatures. The metafeatures together with the metatargets, i.e. good hyperparameter values or good model parameter initializations for a dataset, form the metadataset.
More formally, let \({\mathcal {D}}\) be the space of all possible datasets,
i.e. a data matrix containing a row for each instance and a column for each predictor and target together with the number M of predictors (just to mark which columns are predictors, which targets). For simplicity of notation, for a dataset \(D\in {\mathcal {D}}\) we will denote by \(N^{(D)}, M^{(D)}\) and \(T^{(D)}\) its number of instances, predictors and targets and by \(X^{(D)}, Y^{(D)}\) its predictor and target matrices (see Table 1). Now a metatask is a learning task that aims to find a metamodel \(\hat{{y}}^{\text {meta}}: {\mathcal {D}} \rightarrow {\mathbb R}^{T^{\text {meta}}}\), e.g., for hyperparameter learning of three hyperparameters depth and width of a neural network and regularization weight, to find good values for each given dataset (hence here \(T^{\text {meta}}=3\)), or for model parameter initialization for a neural network with 1 million parameters, to find good such initial values (here \(T^{\text {meta}}=1,000,000\)).
Most metamodels \(\hat{{y}}^{\text {meta}}\) are the composition of two functions:

(i)
the metafeature extractor \(\hat{\phi }: {\mathcal {D}}\rightarrow \mathbb {R}^K\), that extracts from a dataset a fixed number K of metafeatures, and

(ii)
a metafeature based metamodel \(\hat{{Y}}^{\text {meta}}: \mathbb {R}^K \rightarrow \mathbb {R}^{T^{\text {meta}}}\) that predicts the metatargets based on the metafeatures and can be a standard vectorbased regression model chosen for the metatask at hand, e.g., a neural network.
Their composition yields the metamodel \(\hat{{y}}^{\text {meta}}\):
Let \(a^{\text {meta}}\) denote the learning algorithm for the metafeature based metamodel, i.e. stochastic gradient descent to learn a neural network that predicts good hyperparameter values based on dataset metafeatures.
The Metafeature learning problem then is as follows: given i) a metadataset \(({\mathcal {D}}^{\text {meta}}\),\({{\,\mathrm{\mathcal {Y}}\,}}^{\text {meta}})\) of pairs of (primary) datasets D and their metatargets \(y^{\text {meta}}\), ii) a metaloss \(\ell ^{\text {meta}}: {{\,\mathrm{\mathcal {Y}}\,}}^{\text {meta}}\times {{\,\mathrm{\mathcal {Y}}\,}}^{\text {meta}}\rightarrow {\mathbb R}\) where \(\ell ^{\text {meta}}(y^{\text {meta}},\hat{{y}}^{\text {meta}})\) measures how bad the predicted metatarget \(\hat{{y}}^{\text {meta}}\) is for the true metatarget \(y^{\text {meta}}\), and iii) a learning algorithm \(a^{\text {meta}}\) for a metafeature based metamodel (based on \(K\in {\mathbb N}\) metafeatures), find a metafeature extractor \(\hat{\phi }: {\mathcal {D}}\rightarrow {\mathbb R}^K\) s.t. the expected metaloss of the metamodel learned by \(a^{\text {meta}}\) from the metafeatures extracted by \(\hat{\phi }\) over new metainstances is minimal:
such that:
Different from standard regression problems where the loss is a simple distance between true and predicted targets such as the squared error, the metaloss is more complex as its computation involves a primary model being learned and evaluated for the primary learning task. For hyperparameter optimization, if the best depth and width of a neural network for a specific task is 10 and 50, it is not very meaningful to measure the squared error distance to a predicted depth and width of say 5 and 20. Instead one is interested in the difference of primary test losses of primary models learned with these hyperparameters. So, more formally again, let \(\ell \) be the primary loss, say squared error for a regression problem, and a be a learning algorithm for the primary model, then
4 The metafeature extractor Dataset2Vec
To define a learnable metafeature extractor \(\hat{\phi }: {\mathcal {D}}\rightarrow \mathbb {R}^K\), we will employ the KolmogorovArnold representation theorem (Kuurkova 1991) and use the DeepSet architecture (Zaheer et al. 2017).
4.1 Preliminaries
The KolmogorovArnold representation theorem (Kuurkova 1991) states that any multivariate function \(\phi \) of Mvariables \(X_1,\dots ,X_M\) can be represented as an aggregation of univariate functions (Kuurkova 1991), as follows:
The ability to express a multivariate function \(\phi \) as an aggregation h of single variable functions g is a powerful representation in the case where the function \(\phi \) needs to be invariant to the permutation order of the inputs X. In other words, \(\phi \left( X_1,\dots ,X_M\right) = \phi \left( X_{\pi (1)},\dots ,X_{\pi (M)}\right) \) for any index permutation \(\pi (k)\), where \(k \in \{1,\dots , M\}\). To illustrate the point further with \(M=2\), we would like a function \(\phi (X_1, X_2)=\phi \left( X_2,X_1\right) \) to achieve the same output if the order of the inputs \(X_1, X_2\) is swapped. In a multilayer perceptron, the output changes depending on the order of the input, as long as the values of the input variables are different. The same behavior is observed with recurrent neural networks and convolutional neural networks.
However, permutation invariant representations are crucial if functions are defined on sets, for instance if we would like to train a neural network to output the sum of a set of digit images. Setwise neural networks have recently gained prominence as the DeepSet formulation (Zaheer et al. 2017), where the \(2M+1\) functions h of the original KolmogorovArnold representation is simplified with a single large capacity neural network \(h \in {\mathbb R}^K \rightarrow {\mathbb R}\), and the Mmany univariate functions g are modeled with a shared crossvariate neural network function \(g \in {\mathbb R}\rightarrow {\mathbb R}^K\):
Permutationinvariant functional representation is highly relevant for deriving metafeatures from tabular datasets, especially since we do not want the order in which the predictors of an instance are presented to affect the extracted metafeatures.
4.2 Hierarchical set modeling of datasets
In this paper, we design a novel metafeature extractor called Dataset2Vec for tabular datasets as a hierarchical set model. Tabular datasets are twodimensional matrices of \((\#\text {rows} \times \#\text {columns})\), where the columns represent the predictors and the target variables and the rows consist of instances. As can be trivially understood, the order/permutation of the columns is not relevant to the semantics of a tabular dataset, while the rows are also permutation invariant due to the identical and independent distribution principle. In that perspective, a tabular dataset is a set of columns (predictors and target variables), where each column is a set of row values (instances):

\(\square \) A dataset is a set of \(M^{\left( D\right) }+T^{\left( D\right) }\) predictor and target variables:

\(\square D = \left\{ X^{\left( D\right) }_1, \dots , X^{\left( D\right) }_{M^{\left( D\right) }},Y^{\left( D\right) }_1, \dots , Y^{\left( D\right) }_{T^{\left( D\right) }}\right\} \)

\(\square \) Where each variable is a set of \(N^{\left( D\right) }\) instances:

\(\square X^{\left( D\right) }_m = \left\{ X^{\left( D\right) }_{1,m}, \dots , X^{\left( D\right) }_{N^{\left( D\right) },m}\right\} \), \(m=1,\dots ,M^{\left( D\right) }\)

\(\square Y^{\left( D\right) }_t = \left\{ Y^{\left( D\right) }_{1,t}, \dots , Y^{\left( D\right) }_{N^{\left( D\right) },t}\right\} \), \(t=1,\dots ,T^{\left( D\right) }\)
In other words, a dataset is a set of sets. Based on this novel conceptualization we propose to model a dataset as a hierarchical set of two layers. More formally, let us restate that \(\left( X^{\left( D\right) },Y^{\left( D\right) }\right) = D\in {\mathcal {D}}\) is a dataset, where \(X^{\left( D\right) }\in {\mathbb R}^{N^{\left( D\right) }\times M^{\left( D\right) }}\) with \(M^{\left( D\right) }\) represents the number of predictors and \(N^{\left( D\right) }\) the number of instances, and \(Y^{\left( D\right) }\in {\mathbb R}^{N^{\left( D\right) }\times T^{\left( D\right) }}\) with \(T^{\left( D\right) }\) as the number of targets. We model our metafeature extractor, Dataset2Vec, without loss of generality, as a feedforward neural network, which accommodates all schemas of datasets. Formally, Dataset2Vec is defined in Equation 4:
with \(f: {\mathbb R}^2\rightarrow {\mathbb R}^{K_f}\), \(g: {\mathbb R}^{K_f}\rightarrow {\mathbb R}^{K_g}\) and \(h: {\mathbb R}^{K_g}\rightarrow {\mathbb R}^{K}\) represented by neural networks with \(K_f\), \(K_g\) and K output units, respectively. Notice that the best way to design a scalable metafeature extractor is by reducing the input to a single predictortarget pair \(\left( X^{\left( D\right) }_{n,m}, Y^{\left( D\right) }_{n,t}\right) \). This is especially important to capture the underlying correlation between each predictor/target variable. Each function is designed to model a different aspect of the dataset, instances, predictors, and targets. Function f captures the interdependency between an instance feature \(X^{\left( D\right) }_{n,m}\) and the corresponding instance target \(Y^{\left( D\right) }_{n,t}\) followed a pooling layer across all instances \(n\in N^{\left( D\right) }\). Function g extends the model across all targets \(t\in T^{\left( D\right) }\) and predictors \(m\in M^{\left( D\right) }\). Finally, function h applies a transformation to the average of latent representation collapsed over predictors and targets, resulting in the metafeatures. Figure 2 depicts the architecture of Dataset2Vec.
4.3 Network architecture
Our Dataset2Vec architecture is divided into three modules, \(\hat{\phi }{:}{=} f\circ g\circ h\), each implemented as neural network. Let Dense(n) define one fully connected layer with n neurons, and ResidualBlock(n,m) be \(m\times \) Dense(n) with residual connections (Zagoruyko and Komodakis 2016). We present two architectures in Table 2, one for the toy metadataset, Sect. 6.1, and a deeper one for the tabular metadataset, Sect. 6.2. All layers have Rectified Linear Unit activations (ReLUs). Our reference implementation uses Tensorflow (Abadi et al. 2016).
5 The auxiliary metatask: dataset similarity learning
Ideally, we can train the composition \(\hat{M}\circ \hat{\phi }\) of metafeature extractor \(\hat{\phi }\) and metafeature based metamodel \(\hat{M}\) endtoend. But most metalearning datasets are small, rarely containing more than a couple of thousands or 10,000s of metainstances, as each such metainstance itself requires an independent primary learning process. Thus training a metafeature extractor endtoend directly, is prone to overfitting. Therefore, we propose to employ additionally an auxiliary metatask with abundant data to extract meaningful metafeatures.
5.1 The auxiliary problem
Dataset similarity learning is the following novel, yet simple metatask: given a pair of datasets and an assigned dataset similarity indicator \((x,x',i)\in {\mathcal {D}^{\text {meta}}}\times {\mathcal{{D}}^{\text {meta}}}\times \{0,1\}\) from a joint distribution p over datasets, learn a dataset similarity learning model \(\hat{i}: {\mathcal {D}^{\text {meta}}}\times {\mathcal {D}^{\text {meta}}}\rightarrow \{0,1\}\) with minimal expected misclassification error
where \(I(\text {true}){:}{=}1\) and \(I(\text {false}){:}{=}0\).
As mentioned previously, learning metafeatures from datasets \(D\in {\mathcal {D}^{\text {meta}}}\) directly is impractical and does not scale well to larger datasets, especially due to the lack of an explicit dataset similarity label. To overcome this limitation, we create implicitly similar datasets in the form of multifidelity subsets of the datasets, batches, and assign them a dataset similarity label \(i{:}{=}1\), whereas variants of different datasets are assigned a dataset similarity label \(i{:}{=}0\). Hence, we define the multifidelity subsets for any specific dataset D as the submatrices pair \((X',Y')\), Equation 6:
with \(N'\subseteq \{1,\ldots ,N^{\left( D\right) }\}\), \(M'\subseteq \{1,\ldots ,M^{\left( D\right) }\}\), and \(T'\subseteq \{1,\ldots ,T^{\left( D\right) }\}\) representing the subset of indices of instances, features, and targets, respectively, sampled from the whole dataset.
The batch sampler, Algorithm 1, returns a batch with random index subsets \(N'\),\(M'\) and \(T'\) drawn uniformly from \(\{2^q  q \in [4,8]\}, [1,M]\) and [1, T], without replacement. Figure 3 is a pictorial representation of two randomly sampled batches from a tabular dataset.
Metafeatures of a dataset D can then be computed either directly over the dataset as a whole, \(\hat{\phi }(D)\), or estimated as the average of several random batches, Equation 7:
where B represents the number of random batches.
Let p be any distribution on pairs of dataset batches and i a binary value indicating if both subsets are similar, then given a distribution of data sets \(p_{\mathcal {D}^{\text {meta}}}\), we sample multifidelity subset pairs for the dataset similarity learning problem using Algorithm 2.
5.2 The auxiliary meta model and training
To force all information relevant for the dataset similarity learning task to be pooled in the metafeature space, we use a probabilistic dataset similarity learning model, Equation 8:
with \(\gamma \) as a tuneable hyperparameter. We define pairs of similar batches as \(P=\{(x,x',i) \sim p{}{}i = 1\}\) and pairs of dissimilar batches \(W=\{(x,x',i) \sim p{}{}i = 0\}\), and formulate the optimization objective as:
Similar to any metalearning task, we split the metadataset, into \({\mathcal {D}^{\text {meta}}}_{train}\), \({\mathcal {D}^{\text {meta}}}_{valid}\) and \({\mathcal {D}^{\text {meta}}}_{test}\) which include nonoverlapping datasets for training, validation and test respectively. While training the dataset similarity learning task, all the latent information are captured in the final layer of Dataset2Vec, our metafeature extractor, resulting in taskagnostic metafeatures. The pairwise loss between batches allows to preserves the intradataset similarity, i.e. close proximity of metafeatures of similar batches, as well as interdataset similarity, i.e. distant metafeatures of dissimilar batches.
Dataset2Vec is trained on a large number of batch samples from datasets in a metadataset, thus currently does not use any information of the subsequent metaproblem to solve and hence is generic, in this sense unsupervised metafeature extractor. Any other metatask could be easily integrated into Dataset2Vec by just learning them jointly in a multitask setting, especially for dataset reconstruction similar to the dataset reconstruction of the NS (Edwards and Storkey 2017b) could be interesting if one could figure out how to do this across different schemata. We leave this aspect for future work.
6 Experiments
We claim that for a metafeature extractor to produce useful metafeatures it must meet the following requirements: schemaagnostic (D1), expressive (D2), scalable (D3), and correlates to metatasks (D4). We train and evaluate Dataset2Vec in support of these claims by designing the following two experiments. Accordingly, we highlight where the criterion is met throughout the section. Implementation can be found here.^{Footnote 1}
6.1 Dataset similarity learning for datasets of similar schema
Dataset similarity learning is designed as an auxiliary metatask that allows Dataset2Vec to capture all the required information in the metafeature space to distinguish between datasets. We learn to extract expressive metafeatures by minimizing the distance between metafeatures of subsets from similar datasets and maximizing the distance between the metafeatures of subsets from different datasets. We stratify the sampling during training to avoid class imbalance. The reported results represent the average of a 5fold crossvalidation experiment, i.e. the reported metafeatures are extracted from the datasets in the metatest set to illustrate the scalability (D3) of Dataset2Vec to unseen datasets.
6.1.1 Baselines
We compare with the neural statistician algorithm, NS (Edwards and Storkey 2017b), a metafeature extractor that learns metafeatures as context information by encoding complete datasets with an extended variational autoencoder. We focus on this technique particularly since the algorithm is trained in an unsupervised manner, with no metatask coupled to the extractor. However, since it is bound by a fixed dataset schema, we generate a 2D labeled synthetic (toy) dataset, to fairly train the spatial model presented by the authors.^{Footnote 2} We use the same hyperparameters used by the authors. We also compare with two sets of wellestablished engineered metafeatures: MF1 (Wistuba et al. 2016) and MF2 (Feurer et al. 2015). A brief overview of the metafeatures provided by these sets is presented in Table 3. For detailed information about metafeatures, we refer the readers to (Edwards and Storkey 2017a).
6.1.2 Evaluation metric
We evaluate the similarity between embeddings through pairwise classification accuracy with a cutoff threshold of \(\frac{1}{2}\). We set the hyperparameter \(\gamma \) in Equation 8 to 1,0.1, and 0.1 for MF1, MF2, and NS, respectively, after tuning it on a separate validation set, and keep \(\gamma = 1\) for Dataset2Vec. We evaluate the pairwise classification accuracy over 16,000 pairs of batches containing an equal number of positive and negative pairs. The results are a summary of 5fold cross validation, during which the test datasets are not observed in the training process.
6.1.3 Toy meta dataset
We generate a collection of 10,000 2D datasets each containing a varying number of samples. The datasets are created using the sklearn library (Pedregosa et al. 2011) and belong to either circles or moons with 2 classes (default), or blobs with varying number of classes drawn uniformly at random within the bounds (2, 8). We also perturb the data by applying random noise. The toy metadataset is obtained by Algorithm 3. An example of the resulting datasets is depicted, for clarity, in Fig. 4.
We randomly sample a fixedsize subset of 200 samples from every dataset for both approaches, adhering to the same conditions in NS to ensure a fair comparison, and train both networks until convergence. We also extract the engineered metafeatures from these subsets. The pairwise classification accuracy is summarized in Table 4. We conducted a Ttest to validate the distinction between the performance of Dataset2Vec and MF1, the secondbest performing approach. The test is a standard procedure to compare the statistical difference of methods which are run once over multiple datasets (Demšar 2006). Dataset2Vec has a statistical significance pvalue of \(3.25\times 10^{11}\), hence significantly better than MF1, following a 2tailed hypothesis with a significance level of 5% (standard test setting). Dataset2Vec, compared to NS, has \(45\times \) fewer parameters in addition to learning more expressive metafeatures.
The expressivity (D2) of Dataset2Vec is realized in Fig. 5 which depicts a 2D projection of the learned metafeatures, as we observe collections of similar datasets with colocated metafeatures in the space, and metafeatures of different datasets are distant.
Intuitively, it is also easy to understand how the datasets of circles and moons, might be more similar compared to blobs, as seen by the projections in Fig. 5. This might be largely attributed to the large distance between instances of dissimilar classes in the blob datasets, whereas instances of dissimilar classes in the moon and the circle datasets are closer, Fig. 4.
6.2 Dataset similarity learning for datasets of different schema
For a given machine learning problem, say binary classification, it is unimaginable that one can find a large enough collection of similar and dissimilar datasets with the same schema. The schema presents an obstacle that hinders the potential of learning useful metatasks. Dataset2Vec is schemaagnostic (D1) by design.
6.2.1 UCI meta dataset
The UCI repository (Dua and Graff 2017) contains a vast collection of datasets. We used 120 preprocessed classification datasets^{Footnote 3} with a varying schema to train the metafeature extractor by randomly sampling pairs of subsets, Algorithm 2, along with the number of instances, predictors, and targets. Other sources of tabular datasets are indeed available (Vanschoren et al. 2014), nevertheless, they suffer from quality issues (missing values, require preprocessing, etc.), which is why we focus on preprocessed and normalized UCI classification datasets.
We achieve a pairwise classification accuracy of 88.20% ± 1.67, where the model has 45424 parameters. In Table 5, we show five randomly selected groups of datasets that have been collected by a 5Nearest Neighbor method based on the Dataset2Vec metafeatures. We rely on the semantic similarity, i.e. similarity of the names, of the UCI datasets to showcase neighboring datasets in the metafeature space due to the lack of an explicit dataset similarity annotation measure for tabular datasets. For this metadataset, NS could not be applied due to the varying dataset schema.
6.3 Hyperparameter optimization
Hyperparameter optimization plays an important role in the machine learning community and can be the main factor in deciding whether a trained model performs at the stateoftheart or simply moderate. The use of metafeatures for this task has led to a significant improvement especially when used for warmstart initialization, the process of selecting initial hyperparameter configurations to fit a surrogate model based on the similarity between the target dataset and other available datasets, of Bayesian optimization techniques based on Gaussian processes (Wistuba et al. 2015; Lindauer and Hutter 2018) or on neural networks (Perrone et al. 2018). Surrogate transfer in sequential modelbased optimization (Jones et al. 1998) is also improved with the use of metafeatures as seen in the stateoftheart (Wistuba et al. 2018) and similar approaches (Wistuba et al. 2016; Feurer et al. 2018). Unfortunately, existing metafeatures, initially introduced to the hyperparameter optimization problem in (Bardenet et al. 2013), are engineered based on intuition and tuned through trialanderror. We improve the stateoftheart in warmstart initialization for hyperparameter optimization by replacing these engineered metafeatures with our learned taskagnostic metafeatures, proving further the capacity of the learned metafeatures to correlate to unseen metatasks, (D4).
6.3.1 Baselines
The use of metafeatures has led to significant performance improvement in hyperparameter optimization. We follow the warmstart initialization technique presented by Feurer et al. (2015), where we select the topperforming configurations of the most similar datasets to the target dataset to initialize the surrogate model. By simply replacing the engineered metafeatures with the metafeatures from Dataset2Vec, we can effectively evaluate the capacity of our learned metafeatures. Metafeatures employed for the initialization of hyperparameter optimization techniques include a battery of summaries calculated as measures from information theory (Castiello et al. 2005), general dataset features and statistical properties (Reif et al. 2014) which require completely labeled data. A brief overview of some of the metafeatures used for warmstart initialization is summarized in Table 3. NS is not applicable in this scenario considering that hyperparameter optimization is done across datasets with different schema.

1.
Random search (Bergstra and Bengio 2012): As the name suggests, random search simply selects random configurations at every trial, and has proved to outperform conventional gridsearch methods.

2.
Tree Parzen Estimator (TPE) (Bergstra et al. 2011), A treebased approach that constructs a density estimate over good and bad instantiations of each hyperparameter.

3.
GP (Rasmussen 2003): The surrogate is modeled by a Gaussian process with a Matérn 3/2 kernel

4.
SMAC (Hutter et al. 2011): Instead of a Gaussian process, the surrogate is modeled as a randomforest (Breiman 2001) that yields uncertainty estimates.^{Footnote 4}

5.
Bohamiann (Springenberg et al. 2016): This approach relies on Bayesian Neural Networks to model the surrogate.
On their own, the proposed baselines do not carry over information across datasets. However, by selecting the best performing configurations of the most similar datasets, we can leverage the metaknowledge for better initialization.
6.3.2 Evaluation metrics
We follow the evaluation metrics of (Wistuba et al. 2016), particularly, the average distance to the minimum (ADTM). After t trials, ADTM is defined as
with \(\Lambda _t^D\) as the set of hyperparameters that have been selected by a hyperparameter optimization method for data set \(D {:}{=} \in \mathcal {D}\) in the first t trials and \(y(D)^{\min }\), \(y(D)^{\max }\) the range of the loss function on the hyperparameter grid \(\Lambda \) under investigation.
6.3.3 UCI surrogate dataset
To expedite hyperparameter optimization, it is essential to have a surrogate dataset where different hyperparameter configurations are evaluated beforehand. We create the surrogate dataset by training feedforward neural networks with different configurations on 120 UCI classification datasets. As part of the neural network architecture, we define four layouts: \(\square \) layout, the number of hidden neurons is fixed across all layers; \(\lhd \) layout, the number of neurons in a layer is twice that of the previous layer; \(\rhd \) layout, the number of neurons is half of the previous layer, \(\diamond \) layout, the number of neurons per layer doubles per layer until the middle layer than is halved successively. We also use dropout (Srivastava et al. 2014) and batch normalization (Ioffe and Szegedy 2015) as regularization strategies, and stochastic gradient descent (SGD) (Bottou 2010), ADAM (Kingma and Ba 2015) and RMSProp (Tieleman and Hinton 2012) as optimizers. SeLU (Klambauer et al. 2017) represents the selfnormalizing activation unit. We present the complete grid of configurations in Table 6, which results in 3456 configurations per dataset.
6.3.4 Results and discussion
The results depicted in Fig. 6 are estimated using a leaveonedatasetout crossvalidation over the 5 splits of 120 datasets. We notice primarily the importance of metafeatures for warmstart initialization, as using metaknowledge results in outperforming the rest of the randomly initialized algorithms. By investigating the performance of the three initialization variants, we realize that with SMAC and Bohamiann, our learned metafeatures consistently outperform the baselines with engineered metafeatures. With GP, on the other hand, the use of our metafeatures demonstrates an early advantage and better final performance. The reported results demonstrate that our learned metafeatures, which are originally uncoupled from the metatask of hyperparameter optimization prove useful and competitive, i.e. correlate (D4) to the metatask. It is also worth mentioning that the learned metafeatures, do not require access to the whole labeled datasets, making it more generalizable, Eq. 7.
7 Conclusion
We present a novel hierarchical set model for metafeature learning based on the KolmogorovArnold representation theorem, named Dataset2Vec. We parameterize the model as a feedforward neural network that accommodates tabular datasets of a varying schema. To learn these metafeatures, we design a novel dataset similarity learning task that enforces the proximity of metafeatures extracted from similar datasets and increases the distance between metafeatures extracted from dissimilar datasets. The learned metafeatures can easily be used with unseen metatasks, e.g. the same metafeatures can be used for hyperparameter optimization and fewshot learning, which we leave for future work. It seems likely, that metafeatures learned jointly with the metatask at hand will turn out to focus on the characteristics relevant for the particular metatask and thus provide even better metalosses, a direction of further research worth investigating.
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker PA, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for largescale machine learning. OSDI, USENIX Association, Berkeley, pp 265–283
Achille A et al. (2019) Task2vec: Task embedding for metalearning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019
Bardenet R, Brendel M, Kégl B, Sebag M (2013) Collaborative hyperparameter tuning. In: International conference on machine learning, pp 199–207
Bergstra J, Bengio Y (2012) Random search for hyperparameter optimization. J Mach Learn Res 13:281–305
Bergstra JS, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyperparameter optimization. In: Advances in neural information processing systems, pp 2546–2554
Berlemont S, Lefebvre G, Duffner S, Garcia C (2018) Classbalanced siamese neural networks. Neurocomputing 273:47–56
Borg I, Groenen P (2003) Modern multidimensional scaling: theory and applications. J Educ Meas 40(3):277–280
Bottou L (2010) Largescale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer, pp 177–186
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Brinkmeyer L, Drumond RR, Scholz R, Grabocka J, SchmidtThieme L (2019) Chameleon: learning model initializations across tasks with different schemas. arXiv preprint arXiv:1909.13576
Castiello C, Castellano G, Fanelli AM (2005) Metadata: characterization of input features for metalearning. In: MDAI, Springer, Lecture notes in computer science, vol 3558, pp 457–468
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Edwards H, Storkey AJ (2017a) Towards a neural statistician https://openreview.net/forum?id=HJDBUF5le
Edwards H, Storkey AJ (2017b) Towards a neural statistician. In: ICLR, OpenReview.net
Falkner S, Klein A, Hutter F (2018) BOHB: robust and efficient hyperparameter optimization at scale 80:1436–1445. http://proceedings.mlr.press/v80/falkner18a.html
Feurer M, Springenberg JT, Hutter F (2015) Initializing bayesian hyperparameter optimization via metalearning. In: Bonet B, Koenig S (eds) Proceedings of the twentyninth AAAI conference on artificial intelligence, January 25–30, 2015, Austin, Texas, USA, AAAI Press, pp 1128–1135, http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/10029
Feurer M, Letham B, Bakshy E (2018) Scalable metalearning for bayesian optimization. CoRR arXiv:1802.02219
Filchenkov A, Pendryak A (2015) Datasets metafeature description for recommending feature selection algorithm. In: 2015 Artificial intelligence and natural language and information extraction, social media and web search FRUCT conference (AINLISMW FRUCT), IEEE, pp 11–18
Finn C, Abbeel P, Levine S (2017) Modelagnostic metalearning for fast adaptation of deep networks. In: Proceedings of the 34th international conference on machine learningvolume 70, JMLR.org, pp 1126–1135
Finn C, Xu K, Levine S (2018) Probabilistic modelagnostic metalearning. In: NeurIPS, pp 9537–9548
Hewitt LB, Nye MI, Gane A, Jaakkola TS, Tenenbaum JB (2018) The variational homoencoder: learning to learn high capacity generative models from few examples. In: UAI. AUAI Press, pp 988–997
Hutter F, Hoos HH, LeytonBrown K (2011) Sequential modelbased optimization for general algorithm configuration. In: International conference on learning and intelligent optimization. Springer, pp 507–523
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, JMLR.org, JMLR workshop and conference proceedings, vol 37, pp 448–456
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive blackbox functions. J Global Optim 13(4):455–492
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings, arXiv:1412.6980
Kingma DP, Welling M (2014) Autoencoding variational bayes. In: Bengio Y, LeCun Y (eds) 2nd International conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, conference track proceedings. arXiv:1312.6114
Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Selfnormalizing neural networks. In: Advances in neural information processing systems, pp 971–980
Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for oneshot image recognition. In: ICML deep learning workshop, vol 2
Kuurkova V (1991) Kolmogorov’s theorem is relevant. Neural Comput 3(4):617–622
Lindauer M, Hutter F (2018) Warmstarting of modelbased algorithm configuration. In: AAAI. AAAI Press, pp 1355–1362
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikitlearn: machine learning in python. J Mach Learn Res 12:2825–2830
Perrone V, Jenatton R, Seeger MW, Archambeau C (2018) Scalable hyperparameter transfer learning. In: NeurIPS, pp 6846–6856
Rasmussen CE (2003) Gaussian processes in machine learning. In: Summer school on machine learning. Springer, pp 63–71
Reif M, Shafait F, Goldstein M, Breuel TM, Dengel A (2014) Automatic classifier selection for nonexperts. Pattern Anal Appl 17(1):83–96
Rusu AA, Rao D, Sygnowski J, Vinyals O, Pascanu R, Osindero S, Hadsell R (2018) Metalearning with latent embedding optimization. CoRR abs/1807.05960
Segrera S, Lucas JP, García MNM (2008) Informationtheoretic measures for metalearning. In: HAIS, Springer, Lecture notes in computer science, vol 5271, pp 458–465
Snell J, Swersky K, Zemel R (2017) Prototypical networks for fewshot learning. In: Advances in neural information processing systems, pp 4077–4087
Song HO, Xiang Y, Jegelka S, Savarese S (2016) Deep metric learning via lifted structured feature embedding. In: CVPR, IEEE computer society, pp 4004–4012
Springenberg JT, Klein A, Falkner S, Hutter F (2016) Bayesian optimization with robust bayesian neural networks. In: Advances in neural information processing systems, pp 4134–4142
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Tieleman T, Hinton G (2012) Lecture 6.5rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31
Vanschoren J (2018) Metalearning: a survey. arXiv preprint arXiv:1810.03548
Vanschoren J, Van Rijn JN, Bischl B, Torgo L (2014) Openml: networked science in machine learning. ACM SIGKDD Explor Newslett 15(2):49–60
Wistuba M, Schilling N, SchmidtThieme L (2015) Sequential modelfree hyperparameter tuning. In: ICDM, IEEE computer society, pp 1033–1038
Wistuba M, Schilling N, SchmidtThieme L (2016) Twostage transfer surrogate model for automatic hyperparameter optimization. In: ECML/PKDD, Springer, Lecture notes in computer science, vol 9851, pp 199–214
Wistuba M, Schilling N, SchmidtThieme L (2018) Scalable gaussian processbased transfer surrogates for hyperparameter optimization. Mach Learn 107(1):43–78
Yogatama D, Mann G (2014) Efficient transfer learning method for automatic hyperparameter tuning. In: AISTATS, JMLR.org, JMLR workshop and conference proceedings, vol 33, pp 1077–1085
Yoon J, Kim T, Dia O, Kim S, Bengio Y, Ahn S (2018) Bayesian modelagnostic metalearning. In: NeurIPS, pp 7343–7353
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Wilson RC, Hancock ER, Smith WAP (eds) Proceedings of the British machine vision conference 2016, BMVC 2016, York, UK, September 19–22, 2016. BMVA Press. http://www.bmva.org/bmvc/2016/papers/paper087/index.html
Zaheer M, Kottur S, Ravanbakhsh S, Póczos B, Salakhutdinov RR, Smola AJ (2017) Deep sets. In: NIPS, pp 3394–3404
Zheng Z, Zheng L, Yang Y (2018) A discriminatively learned CNN embedding for person reidentification. TOMCCAP 14(1):1–20
Acknowledgements
This work is cofunded by the industry Project “IIPEcosphere: Next Level Ecosphere for Intelligent Industrial Production”. Prof. Grabocka is also thankful to the Eva MayrStihl Foundation for their generous research Grant.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jomaa, H.S., SchmidtThieme, L. & Grabocka, J. Dataset2Vec: learning dataset metafeatures. Data Min Knowl Disc 35, 964–985 (2021). https://doi.org/10.1007/s10618021007379
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618021007379