Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data

Moharil, Ambarish; Vanschoren, Joaquin; Singh, Prabhant; Tamburri, Damian

doi:10.1007/s10994-024-06568-1

Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data

Open access
Published: 19 July 2024

Volume 113, pages 7011–7053, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data

Download PDF

Ambarish Moharil^1,2,
Joaquin Vanschoren¹,
Prabhant Singh¹ &
…
Damian Tamburri²

1033 Accesses
12 Altmetric
1 Mention
Explore all metrics

Abstract

This paper introduces an Automated Machine Learning (AutoML) framework specifically designed to efficiently synthesize end-to-end multimodal machine learning pipelines. Traditional reliance on the computationally demanding Neural Architecture Search is minimized through the strategic integration of pre-trained transformer models. This innovative approach enables the effective unification of diverse data modalities into high-dimensional embeddings, streamlining the pipeline development process. We leverage an advanced Bayesian Optimization strategy, informed by meta-learning, to facilitate the warm-starting of the pipeline synthesis, thereby enhancing computational efficiency. Our methodology demonstrates its potential to create advanced and custom multimodal pipelines within limited computational resources. Extensive testing across 23 varied multimodal datasets indicates the promise and utility of our framework in diverse scenarios. The results contribute to the ongoing efforts in the AutoML field, suggesting new possibilities for efficiently handling complex multimodal data. This research represents a step towards developing more efficient and versatile tools in multimodal machine learning pipeline development, acknowledging the collaborative and ever-evolving nature of this field.

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

An Overview of Multimodal Fusion Learning

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

Article Open access 06 June 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The evolution of Automated Machine Learning (AutoML) has been pivotal in addressing the complexity of developing specialized, end-to-end machine learning pipelines. However, existing AutoML frameworks often face a critical challenge: Efficiently and accurately handling multimodal data (Liu et al., 2021; Wistuba et al., 2019). The lack of generalised frameworks for multi-modality processing (Baevski et al., 2022) and dearth of systemic comparisons between varying information fusion techniques (Liang et al., 2021), prove to be the current hurdles for multimodal AutoML. Multimodal Neural Architecture Search (NAS) is notably resource-intensive, creating a barrier to efficient pipeline development (Elsken et al., 2018; Liu et al., 2018). This paper addresses this challenge by proposing a novel approach that leverages the power of pre-trained Transformer models, renowned for their effectiveness in diverse domains such as Natural Language Processing and Computer Vision (Du et al., 2022; Öztürk et al., 2022; Qiu et al., 2020).

Our approach aims to minimize the reliance on NAS for processing multimodal data. By integrating pre-trained Transformer models, we efficiently bridge the gap between different data modalities, transferring knowledge and reducing the computational overhead commonly associated with NAS. Furthermore, we enhance this integration with warm-started Bayesian Optimization technique which is an intelligent mechanism to guide the search process for optimal pipeline configurations. It leverages historical data and prior experiences, akin to the adaptability observed in human cognitive processes (Van Ackeren et al., 2018). By initiating the search from a promising region within the configuration space, we substantially lower computational demands, providing a practical and effective solution for multimodal AutoML tasks.

This research presents a comprehensive solution to the dual challenges of multimodal data processing in AutoML: reducing the dependency on expensive NAS methods (Wistuba et al., 2019) and effectively integrating pre-trained Transformer models (Liu et al., 2021; Zöller & Huber, 2019). Our approach not only streamlines the development of multimodal ML pipelines but also ensures their adaptability and efficiency, representing an advancement in AutoML for handling complex data modalities such as tabular-text, text-vision, and vision-text-tabular configurations. The major contributions of this paper include the design of a versatile search space (pipeline) for multimodal data, the strategic incorporation of pre-trained models within the pipeline architectures, and the implementation of warm-starting for SMAC using metadata derived from prior evaluations. This novel methodology underscores our commitment to enhancing AutoML’s capability to navigate and optimize multimodal data processing efficiently.

2 Related works

2.1 Automated machine learning (AutoML)

Automated Machine Learning has become a crucial component in data-driven decision-making, enabling domain experts to utilize machine learning without needing extensive statistical expertise. AutoML primarily revolves around the Combined Algorithm Selection and Hyperparameter Optimization (CASH) concept, focusing on scalability and computational efficiency (Zöller & Huber, 2019). While frameworks like the Tree-Based Pipeline Optimization Tool (TPOT) have showcased the potential for sophisticated pipelines in AutoML (Olson & Moore, 2016), challenges persist, especially in Neural Architecture Search (NAS) and hyperparameter optimization.

NAS, essential for customizing architectures, struggles with computational demands and optimization complexities (Wistuba et al., 2019). Meanwhile, advancements in Bayesian optimization have enhanced hyperparameter space navigation (Zöller & Huber, 2019). However, an integrated framework harnessing both NAS and hyperparameter tuning remains elusive. This research responds to this need by proposing a unified AutoML framework that synergizes NAS with hyperparameter optimization for efficient pipeline creation and tuning. A key focus of our approach is the concept of warm-starting, drawing parallels with human cognitive processes to efficiently adapt to new tasks through meta-learning (Barrett, 2017; Vanschoren, 2020). This involves using meta-features and prior evaluations to guide the configuration search, employing a meta-learner to predict performance and expedite the convergence to optimal solutions (Hospedales et al., 2020; Nguyen et al., 2014). Such an approach not only speeds up the optimization process but also imbues AutoML systems with cognitive-like adaptability, enhancing their capability to handle diverse and novel tasks.

2.2 AutoML and pre-trained transformer models

In the AutoML domain, CLIP’s integration in AutoGluon marks a pivotal shift towards multimodal learning (Erickson et al., 2020; Radford et al., 2021). Despite its innovative approach to image-text pairings, CLIP’s limitations in handling text-only or image-only data highlight a need for more versatile models (Radford et al., 2021). Transformers, known for their success in NLP tasks (Devlin et al., 2018; Lan et al., 2019; Liu et al., 2019), and Vision Transformers (ViT) for vision tasks (Dosovitskiy et al., 2020), offer a promising solution with their efficient handling of long input sequences.

Pre-trained vision-language transformer models in AutoML can significantly enhance automatic pipeline synthesis by adeptly handling multimodal data (Du et al., 2022). Recent developments in transformer architectures have produced single and dual-stream models, each with unique strengths in processing multimodal inputs. Single-stream models like OSCAR, VisBERT, and VLBERT offer a unified approach but face challenges with intra-modal interactions (Li et al., 2020, 2019; Su et al., 2019), while dual-stream models like LXMERT and ALBEF excel in cross-modal attention (Li et al., 2021; Tan & Bansal, 2019). Our research explores the impact of various transformer model architectures and pre-training objectives on AutoML systems. We focus on models such as FLAVA, a dual-stream architecture generating cross-modal embeddings (Singh et al., 2021), Albef, aligning modalities before feeding them to a multimodal Transformer (Li et al., 2021), and Data2Vec, a modality-agnostic model (Baevski et al., 2022).

2.3 Configuration space $\Theta$

The construction of a structured configuration space remains central to our inquiry into integrating pre-trained Transformer models within AutoML systems. The configuration space of an AutoML system forms an integral and fundamental part of generating automated pipelines. The configuration space is a space that any search algorithm explores to find specific elements of a Machine Learning Pipeline (Hutter et al., 2019). This space is structured and parameterized to confine the search of a search algorithm (Hutter et al., 2019; Öztürk et al., 2022). Current AutoML systems like AutoWEKA(Thornton et al., 2012), AutoGluon(Erickson et al., 2020) and AutoSklearn (Feurer et al., 2020) construct these spaces as a hierarchical space to enable a guided search strategy. Given n hyperparameters (continuous or categorical) $\lambda _{1}, \lambda _{2} \cdots \lambda _{n}$ with domains $\Lambda _{1}, \Lambda _{2} \cdots \Lambda _{n}$, the configuration space $\Theta$ is a subset of the crossproduct of these domains: $\Theta \subset \Lambda _{1} \times \cdots \times \Lambda _{n} \cup \lambda _{r}$, where $\lambda _{r}$ is a root-level hyperparameter. This subset is strict, such as when certain settings of one hyperparameter render other hyperparameters inactive, inducing a hierarchical structure within the configuration space (Thornton et al., 2012). This hierarchical structure of the configuration space remains critical as it prevents the sampling of incompatible hyperparameters. Incompatibility or a negative transfer can occur two-fold. The configuration space might sample hyperparameters that do not align with the sampled pre-trained model or the sampled pre-trained model, also as a hyperparameter does not align with the said task ($\lambda _{r}$), leading to a negative transfer. More formally, following (Thornton et al., 2012), we say that a hyperparameter $\lambda _{i}$ is conditional on another hyperparameter $\lambda _{j}$, such that $\lambda _{i}$ is only active if hyperparameter $\lambda _{j}$ takes values from a given set $\mathcal {V}_{i}(j) \subsetneq \Lambda _{j}$ in this case, we call $\lambda _{j}$ a parent of $\lambda _{i}$. Conditional hyperparameters can in turn be parents of other conditional hyperparameters, giving rise to a tree-structured space (Thornton et al., 2012).

Any selected search strategy to explore the structured hierarchical configuration space should handle the exploration-exploitation trade-off i.e. finding well-performing algorithms while avoiding premature convergence to a region of sub-optimal algorithms (Elsken et al., 2018). According to Vanschoren (Hospedales et al., 2020), a configuration space $\Theta ^*$ (continuous, categorical or mixed) consisting of hyperparameter settings, pipeline components and/or network architecture components can be learned by meta-learning from evaluations of algorithms on some set of prior evaluations P.

2.4 Multimodal learning and fusion techniques

In light of the developments in Automated Machine Learning (AutoML), particularly in Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO), this section focuses on the fusion strategy for integrating multimodal information. The motivation stems from the challenges in multimodal representation learning, as detailed by Liang et al. through MULTIBENCH, a comprehensive benchmark for multimodal learning spanning diverse datasets and modalities (Liang et al., 2021). Among various fusion paradigms like Early Fusion, Multiplicative Interactions, and Temporal Attention Models, Late Fusion (LF) emerges as notably effective. LF demonstrates a favorable balance between performance and robustness, outperforming more complex methods like MFAS or MuLT (Liang et al., 2021).

Furthermore, LF’s adaptability across various domains is highlighted by Shi et al. (2021), who demonstrate its efficacy in synthesizing end-to-end ML pipelines for combined tabular and text modalities (Erickson et al., 2022). They also show LF’s high accuracy in different AutoML strategies. Based on this empirical evidence and the need for robust, adaptable fusion methods in multimodal learning, Late Fusion is selected as the fusion strategy in our pipeline architecture. This choice aligns with the current research trends and promises enhanced performance and adaptability in handling multimodal datasets within our AutoML framework.

3 Problem formulation

In this subsection, we shall formally describe and present the problem studied in the scope of this work. To better understand the formalism, a list of all the mathematical notations can be found in Table 1.

Table 1 Mathematical Notations and Their Representations

Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data

Abstract

Similar content being viewed by others

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

An Overview of Multimodal Fusion Learning

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

Explore related subjects

1 Introduction

2 Related works

2.1 Automated machine learning (AutoML)

2.2 AutoML and pre-trained transformer models

2.3 Configuration space \(\Theta\)

2.4 Multimodal learning and fusion techniques

3 Problem formulation

3.1 Combined algorithm selection and hyperparameter optimization (CASH)

3.2 The problem of warm-starting

3.3 Configuration space (\(\Theta\)) complexity

4 Pretrained transformer-based AutoML (PTA) methodology

4.1 Overall methodology

4.2 Prior evaluations and meta-dataset construction

4.2.1 Prior evaluations: pipeline variants of multimodal AutoML

4.2.2 Meta-dataset construction

4.3 Construction of configuration space \(\Theta\)

4.4 SMAC setup and execution: warm starting PTA

4.5 Evaluating the optimization: any-time learning metric

5 Experiments and results

5.1 Experimental settings

5.1.1 Results: experiment 1

5.1.2 Results: experiment 2

6 Conclusion and discussion

Data availability

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix

A: Problem formulation

1.1 A.1: General algorithm selection problem

1.2 A.2: Hyperparameter optimisation (HPO) problem

B: Additional related works

1.1 B.1: Pipeline synthesis as a regression problem

1.1.1 Adoption in our methodology

1.2 B.2: Objective function

1.3 B.3: Warm starting PTA

1.4 B.4: Multimodal datasets in consideration

1.4.1 B.4.1: Multimodal datasets: tabular + text

1.4.2 B.4.2: Multimodal datasets: image + text, image + text + tabular

C: Additional results

1.1 C.1: Warm-started SMAC procedure

1.1.1 C.1.1: Best incumbents

1.2 C.2: Prior evaluations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation