1 Introduction

The evolution of Automated Machine Learning (AutoML) has been pivotal in addressing the complexity of developing specialized, end-to-end machine learning pipelines. However, existing AutoML frameworks often face a critical challenge: Efficiently and accurately handling multimodal data (Liu et al., 2021; Wistuba et al., 2019). The lack of generalised frameworks for multi-modality processing (Baevski et al., 2022) and dearth of systemic comparisons between varying information fusion techniques (Liang et al., 2021), prove to be the current hurdles for multimodal AutoML. Multimodal Neural Architecture Search (NAS) is notably resource-intensive, creating a barrier to efficient pipeline development (Elsken et al., 2018; Liu et al., 2018). This paper addresses this challenge by proposing a novel approach that leverages the power of pre-trained Transformer models, renowned for their effectiveness in diverse domains such as Natural Language Processing and Computer Vision (Du et al., 2022; Öztürk et al., 2022; Qiu et al., 2020).

Our approach aims to minimize the reliance on NAS for processing multimodal data. By integrating pre-trained Transformer models, we efficiently bridge the gap between different data modalities, transferring knowledge and reducing the computational overhead commonly associated with NAS. Furthermore, we enhance this integration with warm-started Bayesian Optimization technique which is an intelligent mechanism to guide the search process for optimal pipeline configurations. It leverages historical data and prior experiences, akin to the adaptability observed in human cognitive processes (Van Ackeren et al., 2018). By initiating the search from a promising region within the configuration space, we substantially lower computational demands, providing a practical and effective solution for multimodal AutoML tasks.

This research presents a comprehensive solution to the dual challenges of multimodal data processing in AutoML: reducing the dependency on expensive NAS methods (Wistuba et al., 2019) and effectively integrating pre-trained Transformer models (Liu et al., 2021; Zöller & Huber, 2019). Our approach not only streamlines the development of multimodal ML pipelines but also ensures their adaptability and efficiency, representing an advancement in AutoML for handling complex data modalities such as tabular-text, text-vision, and vision-text-tabular configurations. The major contributions of this paper include the design of a versatile search space (pipeline) for multimodal data, the strategic incorporation of pre-trained models within the pipeline architectures, and the implementation of warm-starting for SMAC using metadata derived from prior evaluations. This novel methodology underscores our commitment to enhancing AutoML’s capability to navigate and optimize multimodal data processing efficiently.

2 Related works

2.1 Automated machine learning (AutoML)

Automated Machine Learning has become a crucial component in data-driven decision-making, enabling domain experts to utilize machine learning without needing extensive statistical expertise. AutoML primarily revolves around the Combined Algorithm Selection and Hyperparameter Optimization (CASH) concept, focusing on scalability and computational efficiency (Zöller & Huber, 2019). While frameworks like the Tree-Based Pipeline Optimization Tool (TPOT) have showcased the potential for sophisticated pipelines in AutoML (Olson & Moore, 2016), challenges persist, especially in Neural Architecture Search (NAS) and hyperparameter optimization.

NAS, essential for customizing architectures, struggles with computational demands and optimization complexities (Wistuba et al., 2019). Meanwhile, advancements in Bayesian optimization have enhanced hyperparameter space navigation (Zöller & Huber, 2019). However, an integrated framework harnessing both NAS and hyperparameter tuning remains elusive. This research responds to this need by proposing a unified AutoML framework that synergizes NAS with hyperparameter optimization for efficient pipeline creation and tuning. A key focus of our approach is the concept of warm-starting, drawing parallels with human cognitive processes to efficiently adapt to new tasks through meta-learning (Barrett, 2017; Vanschoren, 2020). This involves using meta-features and prior evaluations to guide the configuration search, employing a meta-learner to predict performance and expedite the convergence to optimal solutions (Hospedales et al., 2020; Nguyen et al., 2014). Such an approach not only speeds up the optimization process but also imbues AutoML systems with cognitive-like adaptability, enhancing their capability to handle diverse and novel tasks.

2.2 AutoML and pre-trained transformer models

In the AutoML domain, CLIP’s integration in AutoGluon marks a pivotal shift towards multimodal learning (Erickson et al., 2020; Radford et al., 2021). Despite its innovative approach to image-text pairings, CLIP’s limitations in handling text-only or image-only data highlight a need for more versatile models (Radford et al., 2021). Transformers, known for their success in NLP tasks (Devlin et al., 2018; Lan et al., 2019; Liu et al., 2019), and Vision Transformers (ViT) for vision tasks (Dosovitskiy et al., 2020), offer a promising solution with their efficient handling of long input sequences.

Pre-trained vision-language transformer models in AutoML can significantly enhance automatic pipeline synthesis by adeptly handling multimodal data (Du et al., 2022). Recent developments in transformer architectures have produced single and dual-stream models, each with unique strengths in processing multimodal inputs. Single-stream models like OSCAR, VisBERT, and VLBERT offer a unified approach but face challenges with intra-modal interactions (Li et al., 2020, 2019; Su et al., 2019), while dual-stream models like LXMERT and ALBEF excel in cross-modal attention (Li et al., 2021; Tan & Bansal, 2019). Our research explores the impact of various transformer model architectures and pre-training objectives on AutoML systems. We focus on models such as FLAVA, a dual-stream architecture generating cross-modal embeddings (Singh et al., 2021), Albef, aligning modalities before feeding them to a multimodal Transformer (Li et al., 2021), and Data2Vec, a modality-agnostic model (Baevski et al., 2022).

2.3 Configuration space \(\Theta\)

The construction of a structured configuration space remains central to our inquiry into integrating pre-trained Transformer models within AutoML systems. The configuration space of an AutoML system forms an integral and fundamental part of generating automated pipelines. The configuration space is a space that any search algorithm explores to find specific elements of a Machine Learning Pipeline (Hutter et al., 2019). This space is structured and parameterized to confine the search of a search algorithm (Hutter et al., 2019; Öztürk et al., 2022). Current AutoML systems like AutoWEKA(Thornton et al., 2012), AutoGluon(Erickson et al., 2020) and AutoSklearn (Feurer et al., 2020) construct these spaces as a hierarchical space to enable a guided search strategy. Given n hyperparameters (continuous or categorical) \(\lambda _{1}, \lambda _{2} \cdots \lambda _{n}\) with domains \(\Lambda _{1}, \Lambda _{2} \cdots \Lambda _{n}\), the configuration space \(\Theta\) is a subset of the crossproduct of these domains: \(\Theta \subset \Lambda _{1} \times \cdots \times \Lambda _{n} \cup \lambda _{r}\), where \(\lambda _{r}\) is a root-level hyperparameter. This subset is strict, such as when certain settings of one hyperparameter render other hyperparameters inactive, inducing a hierarchical structure within the configuration space (Thornton et al., 2012). This hierarchical structure of the configuration space remains critical as it prevents the sampling of incompatible hyperparameters. Incompatibility or a negative transfer can occur two-fold. The configuration space might sample hyperparameters that do not align with the sampled pre-trained model or the sampled pre-trained model, also as a hyperparameter does not align with the said task (\(\lambda _{r}\)), leading to a negative transfer. More formally, following (Thornton et al., 2012), we say that a hyperparameter \(\lambda _{i}\) is conditional on another hyperparameter \(\lambda _{j}\), such that \(\lambda _{i}\) is only active if hyperparameter \(\lambda _{j}\) takes values from a given set \(\mathcal {V}_{i}(j) \subsetneq \Lambda _{j}\) in this case, we call \(\lambda _{j}\) a parent of \(\lambda _{i}\). Conditional hyperparameters can in turn be parents of other conditional hyperparameters, giving rise to a tree-structured space (Thornton et al., 2012).

Any selected search strategy to explore the structured hierarchical configuration space should handle the exploration-exploitation trade-off i.e. finding well-performing algorithms while avoiding premature convergence to a region of sub-optimal algorithms (Elsken et al., 2018). According to Vanschoren (Hospedales et al., 2020), a configuration space \(\Theta ^*\) (continuous, categorical or mixed) consisting of hyperparameter settings, pipeline components and/or network architecture components can be learned by meta-learning from evaluations of algorithms on some set of prior evaluations P.

2.4 Multimodal learning and fusion techniques

In light of the developments in Automated Machine Learning (AutoML), particularly in Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO), this section focuses on the fusion strategy for integrating multimodal information. The motivation stems from the challenges in multimodal representation learning, as detailed by Liang et al. through MULTIBENCH, a comprehensive benchmark for multimodal learning spanning diverse datasets and modalities (Liang et al., 2021). Among various fusion paradigms like Early Fusion, Multiplicative Interactions, and Temporal Attention Models, Late Fusion (LF) emerges as notably effective. LF demonstrates a favorable balance between performance and robustness, outperforming more complex methods like MFAS or MuLT (Liang et al., 2021).

Furthermore, LF’s adaptability across various domains is highlighted by Shi et al. (2021), who demonstrate its efficacy in synthesizing end-to-end ML pipelines for combined tabular and text modalities (Erickson et al., 2022). They also show LF’s high accuracy in different AutoML strategies. Based on this empirical evidence and the need for robust, adaptable fusion methods in multimodal learning, Late Fusion is selected as the fusion strategy in our pipeline architecture. This choice aligns with the current research trends and promises enhanced performance and adaptability in handling multimodal datasets within our AutoML framework.

3 Problem formulation

In this subsection, we shall formally describe and present the problem studied in the scope of this work. To better understand the formalism, a list of all the mathematical notations can be found in Table 1.

Table 1 Mathematical Notations and Their Representations

3.1 Combined algorithm selection and hyperparameter optimization (CASH)

Our objective for incorporating pre-trained (Transformer) models in AutoML systems is to enable AutoML over unimodal as well as multimodal data. For that, we formulate a Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem. CASH entails selecting optimal learning algorithms from a set \(\mathcal {A}\), which includes pre-trained deep models (\(A_{ptm}^{(i)}\)) and classical machine learning models (\(A_{m}^{(i)}\)), and fine-tuning their hyperparameters (\(\lambda\)) for peak performance.

Formally:

  • Algorithm set \(\mathcal {A}\): Comprises \(m\) pre-trained models (\(A_{ptm}^{(1)}, \ldots , A_{ptm}^{(m)}\)) and \(n\) classical ML models (\(A_{m}^{(1)}, \ldots , A_{m}^{(n)}\)).

  • Hyperparameter space \(\Lambda\): A subset of the Cartesian product of individual hyperparameter domains, \(\Lambda \subset \prod _{i=1}^{m} \Lambda _{ptm}^{(i)} \times \prod _{j=1}^{n} \Lambda _{m}^{(j)}\).

  • Pipeline structures set \(G\): All valid combinations of one pre-trained model and one classical ML model. The abstraction of a valid pipeline structure \(g'\) can be seen in Fig 1.

Fig. 1
figure 1

Overview of the pipeline structure \(g'\) whose components need to be generated and hyperparameters need to be optimised in a combined fashion. \(A_{ptm}\) is the selected pre-trained model which generates the embedding \(\hat{e}\), which is fed to a classical ML model \(A_{m}\) for mapping \(\hat{e}\) to the target domain \(\mathbb {Y}\)

Embeddings (\(\hat{e}\)) from the pre-trained model lie in a high-dimensional space \(\mathbb {R}^{E}\) and are inputs to classical ML models for mapping to the target domain \(\mathbb {Y}\). The goal is to identify the optimal pipeline configuration \(g'\), which comprises a pre-trained model (\(A_{ptm}\)) and a classical ML model (\(A_{m}\)), each with its tuned hyperparameters (\(\lambda _{i}^{*}\) and \(\lambda _{j}^{*}\), \(i \ne j\)). The true performance of a pipeline configuration is given by:

$$\begin{aligned} \hat{R} \left( \mathcal {P}_{g',\hat{A^{*}},\hat{\lambda ^{*}},P}, P \right) = {\mathbb{E}}[\mathcal {L}(h(\mathbb {X}), \mathbb {Y})] = \int \mathcal {L}(h(\mathbb {X}),\mathbb {Y}) \, dP(\mathbb {X,Y}) \end{aligned}$$
(1)

with the true distribution \(P(X,Y)\) being unknown, we approximate performance with a dataset \(D\):

$$\begin{aligned} \hat{R} \left( \mathcal {P}_{g',\hat{A^{*}},\hat{\lambda ^{*}}, D}, D \right) = \frac{1}{m} \sum _{i=1}^{m} \mathcal {L}(h(x_{i}), y_{i}) \end{aligned}$$
(2)

where \(\mathcal {L}\) is the loss function and h() is the approximation function used to describe the true process. The objective is to minimize this estimated performance over \(k\)-fold cross-validation:

$$\begin{aligned} \left( g',\hat{A^{*}}, \hat{\lambda ^{*}} \right) ^{*} = {\mathop {\mathrm{arg\,min}}\limits _{\hat{A^{*}} \in \mathcal {A}^{|g'|}, g' \in G, \hat{\lambda ^{*}} \in \Lambda }}\, \frac{1}{k} \sum _{i=1}^{k} \hat{R} \left( \mathcal {P}_{g',\hat{A^{*}},\hat{\lambda ^{*}},D^{(i)}_{train}}, D^{(i)}_{valid} \right) \end{aligned}$$
(3)

Optimization occurs across a hierarchical hyperparameter space, including a root-level hyperparameter (\(\lambda _{r}\)) for algorithm and pipeline structure selection. It is important to note that our approach does not assume that all pre-trained models will align with the specific learning tasks. Instead, we manage this issue through a structured hierarchical configuration space where the selection of a pre-trained model is analogous to choosing a hyperparameter. This configuration space is designed to activate certain pre-trained models only if the task-specific root-level hyperparameter \(\lambda _{r}\) permits their inclusion. This method ensures that only relevant and task-appropriate models are considered, thereby minimizing the risk of negative transfer.

3.2 The problem of warm-starting

In the context of AutoML’s CASH problem, warm-starting is the process of initiating the optimization with informed configurations derived from prior knowledge, as opposed to random initial configurations in cold-starting. We define warm-starting as the idea to leverage historical data or results from related tasks to jump-start the current optimization task, aiming to reduce the time and computational resources required to reach an optimal solution. The assumptions that encompass the above definition are: sufficient availability of meta-data for initial configuration selection is presumed. Additionally, warm-starting presupposes a balance between exploiting known effective configurations and exploring new possibilities within the configuration space to prevent local optima, thus assuming a vast complex search space. Finally, it operates on the premise that beginning the optimization process near potentially optimal solutions incrementally enhances efficiency and effectiveness, favoring steady progress over random breakthroughs.

To facilitate the warm-starting process, we introduce a meta-dataset M, illustrated in Fig. 2, formalized as follows:

  • The meta-datset M encompasses scalar performance metrics \(P_{j,i}\), each derived from evaluating the ith configuration of an ML pipeline for a task \(t_{j}\) from a set of tasks T. Such configurations integrate a pre-trained model \(A_{ptm}^{(m)}\) and a traditional ML model \(A_{m}^{(n)}\), alongside their aggregate hyperparameters \(\lambda _{j,i} \in \Lambda\), on a given task \(t_{j} \in T\) in a pipeline structure \(g'\). The meta-dataset M thus incorporates \(P_{j,i}\) as real-valued outcomes or targets, juxtaposing a variety of categorical and numerical hyperparameters \(\lambda _{j,i} \in \Lambda\) as the features within M for a task \(t_{j}\), where \(\lambda _{j,i} \in \mathbb {R}^{k}\).

Utilizing a meta-learner \(\psi _{L}\) we project performance indicators \((\mu _{j,i}, \sigma _{j,i})\) for these configurations (\(\psi _{L}:\Lambda \mapsto \mathbb {R}\)). Furthermore, by segregating M into distinct sets for training (\(M^{\text {train}}\) ) and evaluation (\(M^{\text {eval}}\)), we employ an acquisition function \(a_{\mathcal {M}l}\), to accurately identify an initial configuration \(\lambda _{w}\). This configuration \(\lambda _{w}\) is aimed at finding an initial configuration that strikes a balance between predicted efficacy and potential for further exploration, formalized as follows:

$$\begin{aligned} \lambda _{w} \in {\mathop {\mathrm{arg\,max}}\limits _{\lambda _{w}, m(j',i') \in M^{eval}}}\, \sum _{i=1}^{K} \frac{1}{K} a_{\mathcal {M}l} \left( \psi _{L, M^{train}} \left( m \left( j',i' \right) \right) \right) \end{aligned}$$
(4)

The ultimate goal in a warm-started Bayesian Optimization (BO) framework is to ascertain the optimal machine learning pipeline \(\mathcal {P}_{g', \lambda ^{*}}\). This pipeline is composed of a combination of pre-trained and classical models denoted by \(\lambda ^{*}\), configured within a valid structural form \(g'\), where \(g' \in G\). The optimization seeks to minimize the empirical risk \(\hat{\mathcal {R}}\) over the pipeline configurations and structure:

$$\begin{aligned} \left( g', \lambda ^{*} \right) ^{*} \in {\mathop {\mathrm{arg\,min}}\limits _{\lambda ^{*} \in \Lambda , g' \in G}}\, \frac{1}{k} \sum _{i=1}^{k} \hat{\mathcal {R}} \left( \mathcal {P}_{g',\lambda ^{*},D^{(i)}_{train}}, D^{(i)}_{valid} \right) \end{aligned}$$
(5)
Fig. 2
figure 2

Diagrammatic representation of the meta-learner \(\psi _{L}\) facilitated process for selecting the initial configuration \(\lambda _{w}\) to initiate the BO search

This is complemented by the meta-learning strategy, where a meta-learner \(\psi _{L}\) forecasts the mean and standard deviation of the performance for given configurations. The acquisition function is subsequently optimized across the evaluation set to select an initial configuration \(\lambda _{w}\) that maximizes the expected performance. This initial configuration \(\lambda _{w}\) is then employed to commence the BO search procedure. Figure 2 provides an abstract overview of the formulated process.

3.3 Configuration space (\(\Theta\)) complexity

The Configuration Space, denoted as \(\Theta\), is delineated as a complex hybrid space composed of learning algorithms \(\mathcal {A}\), the hyperparameter space \(\Lambda\), and the pipeline structures \(G\). Elements within \(\Theta\) are categorized as either numerical or categorical. The diversity, \(D_{\text {numerical}}\), across \(n\) numerical hyperparameters in \(\Theta\), where \(H_{i,\text {max}}\) and \(H_{i,\text {min}}\) signify the maximum and minimum allowable values for the \(i\)th numerical parameter, is defined as:

$$\begin{aligned} D_{\text {numerical}} = \prod _{i=1}^{n}(H_{i,\text {max}} - H_{i,\text {min}}) \end{aligned}$$
(6)

For hyperparameters scaled logarithmically (e.g., weight decay and layer normalization \(\epsilon\)), the term \((H_{i,\text {max}} - H_{i,\text {min}})\) is substituted by \(\log (\frac{H_{i,\text {max}}}{H_{i, \text {min}}})\). Conversely, the diversity for \(m\) categorical hyperparameters, \(D_{\text {categorical}}\), within \(\Theta\) is the product of available categories for each hyperparameter:

$$\begin{aligned} D_{\text {categorical}} = \prod _{j=1}^{m}C_{j} \end{aligned}$$
(7)

where \(C_{j}\) represents the number of categories for the \(j\)th categorical hyperparameter. The overall complexity of \(\Theta\) is thus a function of both numerical and categorical diversity, alongside the aggregate possible configurations (\(T_{c}\)), formulated as:

$$\begin{aligned} T_{c} = D_{\text {numerical}} \times D_{\text {categorical}} \end{aligned}$$
(8)

Incorporating values from Table 2:

  • For categorical hyperparameters: Pretraining Model (8 options), Pretraining Processors (3 options), Downstream model (12 options), and Downstream processor (1 option).

  • For numerical hyperparameters: Totaling 8, with their respective ranges considered for computation.

  • The \(\Theta\) space encompasses two subsets related to pre-trained models \(\Theta _{ptm}\) and Neural Architecture Search (NAS) algorithms \(\Theta _{NAS}\).

Hence, the complexity, \(T_{C}\), is determined as:

$$\begin{aligned} |T_{c}| = T_{c,\Theta _{ptm}} + T_{c,\Theta _{NAS}} = 9.38 \times 10^{14} \end{aligned}$$
(9)

with \(T_{C,\Theta _{ptm}}\) and \(T_{C,\Theta _{NAS}}\) indicating the complexities of the spaces comprising hyperparameters associated with pre-trained models and NAS algorithms, respectively. Consequently, the estimated complexity of our constructed configuration space \(\Theta\) approximates 938 trillion possible configurations, emphasizing the computational challenge in optimizing within this extensive space.

4 Pretrained transformer-based AutoML (PTA) methodology

Fig. 3
figure 3

Detailed overview of the Pretrained Transformer-based AutoML (PTA) system for warm-starting AutoML over multimodal data. LF denotes Late-Fusion

4.1 Overall methodology

Our research integrates pre-trained deep neural models into Automated Machine Learning (AutoML) Systems to efficiently process multimodal data. Our methodology can be divided into 4 sequential steps namely: Prior Evaluation and Meta-Dataset Construction, Configuration Space (\(\Theta\)) Construction, SMAC Setup and Execution, Evaluating the Optimisation. The rest of this section intends to explain each of the above-mentioned steps in detail. Figure 3 describes the overall workflow of our proposed Pre-Trained Transformer Based AutoML (PTA) framework. Meta-dataset M is constructed as a result of extensive prior (to the SMAC optimisation) evaluations across diverse pipeline configurations, tasks as well as datasets following some pipeline structure \(g'\). Moreover, \(\Theta\) represented in the diagram is the constructed search space after building the meta-dataset M. Encoder, Meta-Model (\(\psi _{L}\)), acquisition function, decoder, objective function, intensifier, and configuration selector are all components within SMAC. We shall understand the interactions of these components in detail in Sect. 4.4.

4.2 Prior evaluations and meta-dataset construction

4.2.1 Prior evaluations: pipeline variants of multimodal AutoML

In this step, we evaluate task-specific variants of multimodal pipeline architectures constructed in a specific pipeline structure \(g'\) across datasets belonging to the tabular-text, text-vision and tabular-text-vision modalities. Furthermore, we evaluate these variants of pipeline architectures across tasks like classification, regression, Image Text Matching (ITM), and Visual Question Answering (VQA). Our framework includes 3 specific pipeline variants, each designed for a specific modality-task combination.


Pipeline Variant 1

  • Tabular-Text Modality This variant, focusing on classification and regression tasks, utilizes FLAVA and Data2Vec models for text data processing and NAS-derived MLP (multimodal-net) for tabular data. The late fusion strategy combines these embeddings, which are then processed by AutoGluon’s Tabular Predictor. The pipeline architecture, as shown in Fig. 4a, is optimized for handling both binary (multi-label) classification as well as regression tasks. The AutoGluon Tabular Predictor explores various tabular architectures, including ensemble tree-based models, and records model performance to inform future meta-learning.

  • Tabular-Text-Vision Modality For this modality, the pipeline employs FLAVA and Albef models to encode image-text data, and the multimodal-net for tabular data. The encoded data from all three modalities is mapped into a unified latent space, implementing late fusion and downsampling using translation invariant methods like MaxPool. The combined embeddings are then processed by AutoGluon’s Tabular Predictor for task-specific execution, as depicted in Fig. 4b. This variant aims to record pipeline performance over multimodal datasets, enriching the meta-dataset for future optimization strategies.

Fig. 4
figure 4

Architectural Designs of the Pipeline Variant 1 for the Tabular-Text and Tabular-Text-Vision Modality


Pipeline Variant 2

Focused on the ITM task, this variant processes unlabelled image-text data using FLAVA and Albef models. Data from datasets like Flickr30k and SBU image captioning is batch-processed and encoded into unified latent embeddings. A regression task maps these embeddings to contrastive scores obtained from the pre-trained models, optimizing the mapping using AutoGluon Tabular, as shown in Fig. 5a. This approach aims to assess model performance and optimize pipeline configurations for the ITM task.


Pipeline Variant 3

Figure 5b designed for the VQA task, this variant focuses on labeled vision-language data from the VQA2.0 dataset. FLAVA and Albef models are used for encoding images and questions. Batch processing is conducted via a PyTorch class object, with the encoded data forming a unified latent space. The combined embeddings and answer targets create a new dataset, processed by AutoGluon Tabular for the VQA task. The focus is on deriving insights from visual-textual interplay and optimizing model selection for the VQA task. We record the performance evaluations (AUC scores) of the multi-label classification task conducted by AutoGluon’s Tabular Predictor using different tree-based ensemble models. This prior knowledge is further incorporated in the form of the meta-dataset M.

Fig. 5
figure 5

Architectural Designs of the Pipeline Variant 2 and 3 respectively

Each variant demonstrates a unique approach to multimodal data processing, leveraging the capabilities of pre-trained models and the exploratory power of AutoGluon Tabular.

4.2.2 Meta-dataset construction

After designing the above 3 pipeline variants, a meta-dataset M is constructed by recording scalar performances \(P_{j,i}\) corresponding to each of these 3 pipeline variants, across various tasks \(t_{j}\) selected from the set comprising of classification, regression, ITM and VQA tasks. Given the ith pipeline configuration \(\lambda _{j,i}\) (pre-processing, pre-trained, traditional ML algorithm names as well as hyperparameters) for the jth task, M records \(\lambda _{j,i}\) and their corresponding \(P_{j,i}\). M is realized as a nested python dictionary object, where the keys of the dictionary are the hyperparameters or algorithm names and the values for the respective keys are the recorded experimental values (numerical or categorical). M also records the names of the pre-trained model along with the names of the traditional ML models used in the pipeline in a string format. The list of hyperparameters corresponding to different pre-trained as well as traditional ML models included within M can be found in Table 2.

4.3 Construction of configuration space \(\Theta\)

The construction of the configuration space \(\Theta\) is crucial for synthesizing effective machine-learning pipelines in our AutoML system. \(\Theta\) serves as a search space for the Sequential Model-Based Optimization (SMBO) algorithm, containing various components such as pre-trained models, feature processors, and classical ML models, structured hierarchically. This hybrid space includes both categorical and numerical hyperparameters described in Table 2.The inclusion of pre-trained models like FLAVA, Albef, and Data2Vec in \(\Theta\) is motivated by their distinct capabilities and performance metrics (Du et al., 2022; Khan et al., 2021). FLAVA, for instance, excels in various multimodal tasks and language benchmarks, making it a valuable inclusion for its broad applicability. Data2Vec’s modality-agnostic nature is pivotal for generating universal representations, while Albef adds diversity with its specific strengths. In addition to these models, \(\Theta\) encompasses conventional tree-based models, classical ML classification and regression models, and various preprocessing algorithms. This configuration space is designed to be a hybrid of categorical choices (such as selecting specific pre-trained models) and numerical hyperparameters (like layer normalization epsilon and dropout probabilities). The space is conditioned on the task type (\(\lambda _{r}\)), with hyperparameters rendered inactive based on the relevance to the task at hand, ensuring efficient and targeted search during optimization (Refer Sects. 2.3 and 3.1).

Table 2 Selected hyperparameters and their corresponding ranges in \(\Theta\)

4.4 SMAC setup and execution: warm starting PTA

Upon the assembly of the hybrid configuration space \(\Theta\), we initialise and warm-start the Sequential Model-Based Algorithm Configuration (SMAC) procedure. This process encompasses the meta-dataset M, meta-model \(\psi _{L}\), acquisition function \(a_{\mathcal {M}l}\), objective function \(f_{\theta }\), intensifier \(\mathcal {I}\), and the configuration selector, aimed at addressing the Combined Algorithm Selection and Hyperparameter Optimization (CASH) challenge. The meta-dataset M, compiled during the preceding evaluation phase, encapsulates data on the pipeline configurations \(\lambda\) alongside their scalar performance evaluations P.

Following the creation of meta-dataset M, we establish the Scenario \(\mathcal {S}\) for optimisation, which delineates the optimisation landscape, specifying iterations, budget, and the exploration bounds within \(\Theta\). The intricacies of our optimisation scenario \(\mathcal {S}\) will be elaborated upon in the experiments and results segment. Post the configuration of \(\mathcal {S}\), the meta-learner \(\psi _{L}\) is trained on MFootnote 1 employing 3-fold cross-validation. Having defined the modality and the task as a root level hyperparameter \(\lambda _{r}\), the configuration selector samples only the hyperparameters activated under the defined subset of the structured configuration space. Furthermore, to ascertain the validity of the pipeline, a conditional logic is designed to check whether the sampled choice of the pre-trained model (hyperparameter) lies within the permitted zoo of models for the given type of input task and modality. With these we prevent incompatibility or negative-transfer. The configuration selector begins the optimization process by identifying a set of n random initial samples confined within the hierarchy and boundaries of our defined configuration space. These n configurations’ performance metrics are predicted using the trained meta-learner \(\psi _{L}\). The RandomForest meta-learner \(\psi _{L}\) then undertakes a regression task, mapping the hyperparameter configurations \(\Lambda\), within a high-dimensional space \(\lambda \in \mathbb {R}^{k}\), to a real value in \(\mathbb {R}\), i.e., \(\psi _{L}: \Lambda \mapsto \mathbb {R}\). With the mean predictive performance and uncertainty estimates derived from \(\psi _{L}\), we maximize the Expected Improvement (EI) acquisition function \(a_{\mathcal {M}l}\) to pinpoint the initial configuration with the highest potential for achieving optimal performance. The selected initial configuration \(\lambda _{w}\), boasting the highest EI score, is extracted from the configuration space for actual assessment by the objective function. This function, \(f_{\theta }\), correlates the input features \(X \in \mathbb {R}^{d}\) with a real-valued performance metric p, expressed as \(f_{\theta }: \mathbb {R}^{d} \mapsto \mathbb {R}\). At this juncture, the pipeline, integrating the chosen configurations, is applied to the input data to fulfil the designated task objective. \(f_{\theta }\) appraises this pipeline’s efficacy for the specific configuration \(\lambda\), including hyperparameters and models, converting these into a scalar metric such as AUC for classification or \(R^2\) for regression tasks. Designed to accommodate multimodal data, \(f_{\theta }\) adjusts its evaluations based on the data modality and the precise task, engaging diverse pre-trained models like FLAVA, Albef, and Data2Vec to ensure its assessments are both accurate and pertinent. Consequently, \(f_{\theta }\) evaluates \(\lambda _{w}\), updating the performance metric in M. \(\lambda _{w}\) is thus acknowledged as the initial incumbent configuration.

To select the subsequent configuration \(\lambda _{w+1}\), \(\psi _{L}\) undergoes re-training with the refreshed M. The configuration selector then extracts m random configurations, applying a 10–12% perturbation rate around the incumbent configuration (\(\lambda _{w}\)), facilitating the exploration of well-performing setups. The configuration with the maximal expected improvement is earmarked for actual evaluation through \(f_{\theta }\). Should this configuration yield a performance metric surpassing that of the incumbent, it is incorporated into the intensifier queue, updating the incumbent and M concurrently. The intensifier \(\mathcal {I}\) adopts an Aggressive Racing Strategy, probing the vicinities of promising incumbent configurations by instructing the configuration object to sample m configurations with the specified perturbation rate around these incumbents. The evaluations of these sampled configurations proceed in parallel, with the evaluation records continually refreshed in M, and \(\psi _{L}\) being re-trained prior to each sampling iteration. This cycle repeats until the optimisation budget depletes. Figure 3 furnishes a graphical representation of this warm-started SMAC optimisation procedure.

4.5 Evaluating the optimization: any-time learning metric

In our AutoML framework, evaluating the optimization process post-convergence of the SMAC loop is crucial. To achieve this, we adopt the ’any-time’ learning metric, taking inspiration from Liu et al. (2021), emphasizing efficiency under time and data constraints. This is quantified through the Area Under Learning Curve (ALC), expressed as:

$$\begin{aligned} ALC = \frac{1}{\log \left( 1 + \frac{T}{t_{0}} \right) } \int _{0}^{T} \frac{s(t)}{t + t_{0}}dt \end{aligned}$$
(10)

where s(t) is the scoring function. For classification tasks, we make use of the Normalised Area Under the Curve (NAUC)Footnote 2 as the scoring function. Moreover, for regression tasks, \(R^{2}\) score of the model acts as the scoring function in the above equation. The ALC metric captures the learning trajectory over time, especially in the initial phases, reflecting the system’s rapid learning efficiency. By computing the ALC for each dataset, we comprehensively evaluate the AutoML system’s optimization process, focusing on its ability to adapt and learn effectively within limited time frames. This methodological approach thoroughly assesses the framework’s learning behavior and operational efficiency in resource-constrained environments. Additionally, the hierarchical nature of the configuration space ensures the activation of only task-related hyperparameters, rendering task-unrelated hyperparameters including pre-trained models as well as their corresponding hyperparameters inactive. Thus, given this hierarchy, our evaluation function carefully evaluates scenarios where the pipeline components are entirely compatible with the learning task given a task and a dataset. The configuration sampler strives to avoid incompatibility and the hypothetical occurrence of any would result in consistent crashing and low ALC values. A high ALC value indicates consistent performance and effective sampling of learning algorithms by the AutoML framework, suggesting that the configurations chosen are generally well-suited to the tasks. In case a trial (sampled pipeline configuration) fails to complete its training within the trial budget, we record the performance of such a sample as \(-inf\) or CRASH. Incompatible samples are handled similarly.

5 Experiments and results

5.1 Experimental settings


Experiment 1: Prior evaluations of multimodal pipeline architectures

This experiment aims to collect prior insights by evaluating various pipeline configurations across 23 multimodal datasets, documenting the configurations \(\lambda _{j,i}\) and their performance estimates \(P_{j,i}\) in a meta-dataset M. We explore three variants of our proposed multimodal pipeline architectures that employ Late Fusion to integrate multiple modalities for tasks including classification, regression, image-text matching (ITM), and visual question answering (VQA). The experimental setup involves selecting a pre-trained multimodal vision-language transformer to represent vision-language data within a unified latent space. Additionally, we employ Neural Architecture Search (NAS) via AutoGluon to construct a dynamic multilayer perceptron (MLP) for tabular data, aligning it within the same latent space. The Late Fusion process linearizes the final embeddings, which are then utilized by AutoGluon Tabular to predict the respective targets.

For classification and regression tasks, we utilize the 18 AutoMM Benchmark Datasets introduced by Shi et al. (2021). To evaluate classification tasks we make use of the NAUC metric and for regression tasks we study thr \(R^{2}\) score. The ITM task evaluations are conducted using the Flickr30k (Plummer et al., 2015) and SBU Image Captioning Dataset (Ordonez et al., 2011), while VQA task performance is assessed on the VQA 2.0 Dataset (Agrawal et al., 2015). Additionally, the PetFinder and CD-18 Datasets are employed to evaluate classification and regression tasks within the tabular-text-vision modality. The first variant of our pipeline architecture, as depicted in Figs. 4a, b, is utilized for evaluating tabular-text and tabular-text-vision modalities in classification and regression tasks, respectively. The second pipeline variant, shown in Fig. 5a, is used for ITM tasks, while the third variant addresses the VQA task. Each evaluation session spans \(\approx\) 5–6 h to ensure optimal or near-optimal pipeline configurations are achieved. It is important to note that the pre-trained models’ weights are kept frozen, with only the downstream ML models’ weights being fine-tuned. This implies that the variation in performance observed during the optimization process arises solely from the adjustments in the hyperparameters (of the pre-trained models) rather than any changes in the model’s weights. Freezing the pre-trained weights enables us to study the variation in the model performance especially when the pre-trained models are kept the same and the hyperparameters are varied and also when we vary the pre-trained models along with their hyperparameters altogether for a given task-modality and dataset.


Experiment 2: Assessing the efficacy of warm-started PTA

This experiment evaluates the SMAC optimization curve across varied pipeline variants for multimodal tasks, using the ALC metric that utilizes a scoring function \(s(t)\)Footnote 3 for configuration performance at times \(t\). We focus on assessing our warm-started PTA framework’s optimization quality by studying the ALC and \(s(t)\) scores through the learning curves, identifying promising configurations for further analysis within a 45 min limit, as suggested by Liu et al. (2021). This Scenario \(\mathcal {S}\) is set up, with trials capped at 20 min and overall optimization limited to 45 min. With this specific budget, we aim to highlight the efficiency of our warm-started PTA framework in sampling optimal incumbent candidate/s within a vast search space under a limited budget. Unsuccessful trials are marked as -inf or CRASH, with successful ones \(\gamma \subseteq \Gamma\) being evaluated for consistent exploration of high-performing candidates, represented by high ALC scores (\(\approx 1\)). We address the challenge of inappropriately chosen pre-trained models by including their evaluation in the ALC and marking mismatches or negative transfers with a performance score of “-inf,” thereby avoiding their impact on the system’s learning curve. This approach ensures that frequent incompatibilities lead to fewer evaluations and lower ALC scores, indicating areas for potential improvement. Then we analyze the learning curves across \(23\) multimodal datasets, plotting the scoring function \(s(t)\)-NAUC for classification and \(R^{2}\) for regression-against log-scaled time (Fig. 6a, b) and compute the Area Under the Learning Curve (ALC) as an any-time learning metric to evaluate the consistency of optimization efforts for all the 23 multimodal datasets.

Moreover, we compare the efficacy of our warm-started PTA, leveraging pre-trained models, against an inherently cold-started NAS (multimodal-net) method implemented by Autogluon for handling multiple modalities. The main reason for this comparison is to showcase an efficient exploration of a complex search space under budgeted constraints, based on a hybrid pipeline approach (pre-trained transformers + NAS) infused with prior knowledge as compared to a computationally expensive cold-started, NAS method. Furthermore, we implement Late Fusion (LF) for integrating modalities as suggested by Shi et al. (2021), Liang et al. (2021), a method performing the best across 18 real-world datasets as showcased by Shi et al. (2021) with their experiments on Autogluon. Furthermore, the only pre-trained multimodal model implemented by Autogluon is CLIP, apart from its NAS methods for handling multimodal data, we intend to compare autogluon’s performance against a set of a hybrid AutoML architecture, incorporating more sophisticated multimodal models.

For tabular-text modality, we benchmark against the AutoMM Benchmark by Shi et al. (2021), using a cold-started NAS-based multimodal-net over 5–6 h. Absent a public benchmark for vision-language tasks, we assess the warm-started PTA’s (under 45 min budget constraint) performance against Autogluon’s NAS-based multimodal-net to evaluate incumbent scores on a similar budget of 45 min. With this comparison, we aim to show that by reducing the dependency on NAS for multimodal processing, the efficiency of pipeline generation under a constrained budget could be significantly improved.

5.1.1 Results: experiment 1

In this section, we report some observed average metrics for the datasets and tasks selected for prior evaluations. For the AutoMM Benchmark datasets, we report the average AUC and \(R^{2}\) scores ( \(\mu\) in Table 3) obtained by the FLAVA and Data2Vec pipeline variants across the classification and regression datasets from the AutoMM benchmark respectively. For the Flickr30k, SBU, VQA, Petfinder and CD-18 datasets, we report the average scores (\(\mu\)) obtained across different pipeline hyperparameters and downstream traditional ML models fitted through Autogluon Tabular, after fixating a pre-trained multimodal model within the multimodal pipeline.

Table 3 FLAVA, Albef, and Data2Vec prior evaluation results across 23 Datasets (18 AutoMM, Flickr30k, SBU, PetFinder, and CD-18) for the 3 modalities, over classification, regression, ITM, and VQA tasks

Tabular + Text modality

FLAVA demonstrates an average AUC score of 0.354 across the classification datasets and an average \(R^{2}\) score of 0.415 across the regression datasets, with a minimal variance of 0.001 and 0.003 respectively, indicating consistent handling of tabular and text data. This performance is 23% and 55% higher than Data2Vec’s average AUC and \(R^{2}\) scores of 0.272 and 0.236 respectively. However, it’s important to note that FLAVA requires a higher average prediction time across both classification and regression datasets compared to Data2Vec. Despite this longer prediction time, FLAVA’s scores are not significantly higher than the ones obtained on the relatively faster Data2Vec pipelines.


Text + Vision modality

For the Visual Question Answering task over the VQA2.0 dataset, Albef and FLAVA give \(\approx\) similar performances, an AUC score of 0.933 and 0.931 respectively, with Albef pipelines performing slightly better on average across different downstream models. The observed difference in the average prediction and fit times across Albef and FLAVA pipelines appears to be relatively small, with both pipeline configurations showing consistent performances (\(\approx\) 0 \(\sigma ^{2}\)) across different downstream ML models. In the ITM task, Albef slightly outperforms FLAVA on the Flickr30k (0.216 vs. 0.205) and SBU datasets (0.181 vs. 0.177). However, Albef pipelines are extremely efficient than the FLAVA pipelines for prediction on ITM tasks. The average prediction time across the VQA dataset is approximately the same for the pipeline configurations consisting of FLAVA and Albef models.


Tabular + Text + Vision modality

Albef’s performance in the complex multimodal scenarios of PetFinder and CD-18 datasets (0.393 and 0.424, respectively) edges out FLAVA’s scores (0.373 and 0.412), showing a 5% to 3% improvement. Albef also shines in computational efficiency, particularly in the CD-18 dataset, where its average prediction time is 178.171 s, nearly 20% faster than FLAVA’s 215.790 s.


Stability Across Models

The stable performance of FLAVA and Albef, as indicated by low variance scores \(\sigma ^{2}\), highlights their reliability in AutoML strategies. Despite Data2Vec’s modest performance, its modality-agnostic feature stands out. Our evaluation compares pipeline performances with pre-trained models such as FLAVA, Albef, and Data2Vec, underpinning the belief that high-quality, higher-dimensional representations are crucial for predictive accuracy. These representations are vital for the success of downstream models, with the optimization and quality of embeddings significantly impacting performance outcomes. High-quality embeddings from these pre-trained models are key to strong performance, underscoring their importance in our AutoML framework. The performances of FLAVA, Albef, and Data2Vec across various modalities offer insights into their distinct strengths, informing future AutoML model selection and strategy enhancements.

5.1.2 Results: experiment 2

Fig. 6
figure 6

Learning curves depicting the obtained \(R^{2}\) scores for the Flickr30k, SBU (6a) and CD-18 datasets (6b) and the NAUC scores for the PetFinder dataset (6b), as a function of time across different pipeline configurations evaluated during the SMAC optimisation process for the ITM and VQA tasks respectively. x-axis: log(t), y-axis: s(t)

Fig. 7
figure 7

ALC scores obtained using our warm-started PTA for the 3 studied modalities under a budget T (45 mins). x-axis: datasets, y-axis: ALC scores. blue: classification data, orange: regression data

Figures 7a–c illustrate the ALC values from the 23 multimodal datasets, highlighting the efficiency and efficacy of our warm-started PTA framework through high ALC scores of 0.762, 0.803, 0.815, and 0.693 for the Petfinder, VQA2.0, Flickr30k, and SBU datasets, respectively, across tabular-text-vision and text-vision modalities. The learning curves depicted in Fig. 6a, b, representing the Flicker30K, SBU, Petfinder, and CD-18 datasets, underscore the ALC values, with scores nearing 1 signifying the generation of incumbents with optimal s(t) values throughout the 45 min optimization period.

Fig. 8
figure 8

Comparison of the NAUC/\(R^{2}\) scores obtained using our PTA framework with AutoGluon for the Tabular-Text, Text-Vision and Tabular-Text-Vision modalities respectively (y-axis: 0 to 1). x-axis: datasets, y-axis: NAUC or \(R^{2}\) (s(t)) scores. blue: classification data, orange: regression data

For the tabular-text classification task, the salary dataset achieved the highest ALC score of 0.819 (Fig. 7c). The average ALC score for the classification datasets within the AutoMM benchmark is 0.711, as shown in Fig. 7c, indicating substantial performance despite budget limitations. Additionally, an average ALC score of 0.702 was noted for regression datasets (Fig. 7c), demonstrating our framework’s capability to produce effective pipeline configurations under time constraints. Comparing the incumbents synthesized by our warm-started PTA framework against those from a cold-started NAS method over a similar budget of 45 min, Fig. 8 contrasts the s(t) scores across both approaches for selected datasets. The PTA incumbents for tabular-text classification datasets averaged higher performance (0.7521) compared to the NAS-based multimodal-net by Autogluon (0.530) (Fig. 8a, b), and for tabular-text regression tasks, PTA incumbents also outperformed (0.695 \(R^{2}\)) against the NAS-based approach (\(0.196 R^{2}\), Fig. 8a, b). Specifically, for the VQA2.0 dataset, the warm-started PTA framework achieved a NAUC score of approximately 0.820 (Fig. 8c), surpassing the NAS-based approach’s score of approximately 0.433 (Fig. 8d). Similarly, ITM tasks on the Flickr30k and SBU datasets resulted in incumbent \(R^{2}\) scores of 0.912 and 0.810 (Fig. 8c), respectively, demonstrating superior performance over the Autogluon MLP (Fig. 8d). For tabular-text-vision modalities, like the Petfinder and CD-18 datasets (Fig. 8e, f), the performance of PTA incumbents closely matched the NAS-based incumbents. Based on these results and additional results stated in the Appendix (Figs. 9, 10), we could infer that it takes Autogluon considerable amount of resources, both computational and temporal to construct a complex Neural Architecture from scratch, especially for non-tabular modalities. The high ALC values obtained for our approach across 23 datasets underscore the efficiency of an approach, utilising NAS only for the tabular modality and leveraging prior experiences for an informed search across a complex search space, tends to consistently generate well-performing pipeline configurations, under a constrained computational budget. Moreover,the combined cross and intra modal interactions facilitated by our pre-trained models, preserve and capture complex non-linear interactions (\(\hat{e}\)), which turn out to be more informative than the representations constructed by Autogluon’s multimodal-net.

Fig. 9
figure 9

Warm-Started SMAC Results for the AutoMM Benchmark (Shi et al., 2021)

Fig. 10
figure 10

Warm-Started SMAC Results for the AutoMM Benchmark (Shi et al., 2021)

6 Conclusion and discussion

This study advances the Automated Machine Learning (AutoML) field, emphasizing multimodal data processing across visual, textual, and tabular inputs. By incorporating pre-trained models, meta-learning, and optimization strategies, we’ve explored innovative approaches for complex pipeline configurations. Our experiments demonstrate the framework’s rapid convergence to optimal configurations across various modalities, highlighted by its performance on text-vision tasks using datasets like Flickr30k and SBU Image Captioning. The high NAUC scores and Area Under the Learning Curve (ALC) scores across 23 datasets attest to our framework’s efficiency in crafting effective multimodal pipeline architectures within computational constraints in a consistent manner. Our comparisons with traditional NAS methods reveal our framework’s superior efficiency, especially in time-limited scenarios, showcasing the strength of warm-starting and partial dependence on NAS along with potential areas for improvement. The framework’s success in resource-limited settings indicates its potential applicability in real-world scenarios, when searching for well-performing architectures, laying a foundation for further research and the need for broader testing and validation in diverse environments.

Recognizing the limitations of our work, in our AutoML framework, we utilize a warm-start approach where pre-trained models are incorporated with their weights frozen. This ensures that performance variations during optimization arise solely from hyperparameter adjustments, not changes in model (pre-trained) weights. Our study focuses on how these hyperparameters impact the efficacy of a static, pre-trained model architecture. We do not fine-tune the pre-trained model weights; instead, we assess how different hyperparameter settings exploit the pre-trained models to generate and use latent representations of data across vision, text, or combined modalities. This method isolates performance variations to hyperparameter effects, ensuring clear attribution in our findings. Due to the complexity of the configuration space, this study does not examine the parameter manifold of pre-trained models, focusing instead on distinguishing between models that can still improve and those that cannot. Future work will focus on expanding the framework’s application in various settings, including sampling of parameters from parameter-spaces and further refining its capabilities to meet the evolving demands of AutoML solutions.