Is deep learning necessary for simple classification tasks?

Automated machine learning (AutoML) and deep learning (DL) are two cutting-edge paradigms used to solve a myriad of inductive learning tasks. In spite of their successes, little guidance exists for when to choose one approach over the other in the context of specific real-world problems. Furthermore, relatively few tools exist that allow the integration of both AutoML and DL in the same analysis to yield results combining both of their strengths. Here, we seek to address both of these issues, by (1.) providing a head-to-head comparison of AutoML and DL in the context of binary classification on 6 well-characterized public datasets, and (2.) evaluating a new tool for genetic programming-based AutoML that incorporates deep estimators. Our observations suggest that AutoML outperforms simple DL classifiers when trained on similar datasets for binary classification but integrating DL into AutoML improves classification performance even further. However, the substantial time needed to train AutoML+DL pipelines will likely outweigh performance advantages in many applications.


Introduction and Background
Deep learning (DL) and automated machine learning (AutoML) are two approaches for constructing high-performance estimators that dramatically outperform traditional machine learning in a variety of scenarios, especially on classification and regression tasks. In spite of their successes, there remains substantial debate-and little quantitative evidence-on the practical advantages of the two approaches and how to determine which will perform best on specific real-world problems.
To address this issue, we conduct a series of experiments to compare the performance of DL and AutoML pipelines on 6 well-characterized binary classification problems. We also introduce and critically assess a new resource that leverages the strengths of both approaches by integrating deep estimators into an existing AutoML tool for constructing machine learning pipelines using evolutionary algorithms. Specifically, we sought to answer two questions in this study: 1. How well does genetic programming-based AutoML perform on simple binary classification tasks in comparison to DL?
2. By augmenting AutoML with DL estimators, do we achieve better performance than with either of the two alone?
Beyond these two questions, we also provide specific recommendations on choosing between DL and AutoML for simple classification tasks, and a set of priorities for future work on the development of methods that integrate DL and AutoML. Our new software resource-which we have named TPOT-NN-is an extension to the previously described Tree-based Pipeline Optimization Tool (TPOT), and is freely available online.

Deep learning and artificial neural networks
Deep learning is an umbrella term used to describe statistical methods-usually involving artificial neural networks (ANNs)-that provide estimators consisting of multiple stacked transformations (Schmidhuber, 2015;LeCun et al., 2015). Traditional estimators (i.e., "shallow" models, such as logistic regression or random forest classifiers) are limited in their ability to approximate nonlinear or otherwise complex objective functions, resulting in reduced performance, particularly on datasets where entities with similar characteristics are not cleanly separated by linear decision boundaries. Meanwhile, DL estimators overcome this limitation through sequential nonlinearities. It follows that, given a sufficient number of layers and an appropriate learning algorithm, even structurally simple feed-forward neural networks with a finite number of neurons can approximate virtually any continuous function on compact subsets of Euclidean space (Leshno et al., 1993).
Nonetheless, the successes of DL have been tempered by a number of important criticisms. Compared to shallow models, it is substantially more complex to parameterize a deep ANN due to the explosion of free parameters that results from increased depth of the network or width of individual layers (Shi et al., 2016). Furthermore, DL models are notoriously challenging to interpret since the features in a network's intermediate layers are a combination of features from all previous layers, which effectively obscures the intuitive meaning of individual feature weights in most nontrivial cases (Lipton, 2018;Lou et al., 2012).
It is also worth noting that DL model architectures can reach immense sizes. For example, the popular image classification network ResNet performed best in an early study when constructed with 110 convolutional layers, containing over 1.7 billion tunable parameters (He et al., 2016). However, for standard binary classification on simple datasets, smaller DL architectures can still substantially outperform shallow learners, both in terms of error and training time (Auer et al., 2002;Collobert and Bengio, 2004). For the purpose of establishing a baseline comparison between AutoML and DL, we restrict our analyses in this study to this latter case of applications.

Automated Machine Learning
One of the most challenging aspects of designing an ML system is identifying the appropriate feature transformation, model architecture and hyperparameterization for the task at hand. For example, count data may benefit from a square-root transformation. Similarly, a support-vector machine (SVM) model might predict more accurately susceptibility to a certain complex genetic disease than a gradient boosting model trained on the same dataset. Further, different choices of hyperparameters within that SVM model of kernel function k and soft margin width C can lead to wildly different performances. Traditionally, these architecture considerations need to be made with the help of prior experience, brute-force search, or experimenter intuition, all of which are undesirable for their own reasons.
AutoML, on the other hand, provides methods for automatically selecting these options from a universe of possible architecture configurations. A number of different AutoML techniques can be used to find the best architecture for a given task, but one that we will focus on is based on genetic programming (GP). Broadly, GP constructs trees of mathematical functions that are optimized with respect to a fitness metric such as classification accuracy (Banzhaf et al., 1998). Each generation of trees is constructed via random mutations to the tree's structure or the operations performed at each node in the tree. Repeating this process for a number of training generations produces an optimal tree. Like in natural evolution, increasingly more fit architectures are propagated forward while less fit architectures "die out".
TPOT (Tree-based Pipeline Automation Tool) is a Python-based AutoML tool that uses genetic programming to identify optimal ML pipelines for either regression or classification on a given (labeled) dataset. Briefly, TPOT performs GP on trees where nodes are comprised of operators, each of which falls into one of four operator types: preprocessors, decomposition functions, feature selectors, or estimators (i.e., classifiers and regressors). Input data enters the tree at leaf nodes, and predictions are output at the root node. Each operator has a number of free parameters that are optimized during the training process. TPOT maintains a balance between high performance and low model complexity using the NSGA-II Pareto optimization algorithm (Deb et al., 2002). As a result, the pipelines learned by TPOT consist of a relatively small number of operators (e.g., in the single-digits) that can still meet or exceed the performance of competing state-of-the-art ML approaches.
In theory, TPOT can construct pipelines that are structurally equivalent to DL estimators by stacking multiple shallow estimators in a serial configuration. However, since the individual objectives for each operator are decoupled, it is unclear whether these configurations can attain the same performance as standalone DL estimators, such as multilayer perceptrons (MLPs).
With the exception of our new deep learning estimators (which are implemented in PyTorch), all operators are implemented in either Scikit-learn (Pedregosa et al., 2011) or XGBoost (Chen and Guestrin, 2016), both of which are open-source, popular Pythonbased machine learning libraries. TPOT natively performs cross validation and generates Python scripts that implement the learned pipelines. For a more detailed description and evaluations of TPOT, please see (Olson et al., 2016a,b;Le et al., 2020).

Results
We evaluated the pipelines' performance based on two metrics: classification accuracy of the trained pipeline and the elapsed time to train the pipeline. Each experiment can be grouped into 1 of 3 main configurations: NN (a single MLP classifier with no GP); TPOT (pipelines learned using GP, possibly containing multiple stacked estimators, preprocessors, and feature transformers); and TPOT-NN (the same as TPOT, but with added MLP and PyTorch logistic regression estimators). The contrived examples in Figure 1 illustrate the differences between pipelines learned from the 3 configurations.
In §5.2, we describe MLP architectures in this study as well as rationale for our specific design decisions.

Model performance comparison-NN, TPOT, and TPOT-NN
The prediction accuracy distributions of our experiments are shown in Figure 2. For each of the 6 datasets, neural network estimators alone yielded the lowest average prediction accuracy while the TPOT-NN pipelines performed best. In general, the TPOT-NN pipelines performed only marginally better than the standard TPOT pipelines. Notably, the (non-TPOT) neural network approach yielded substantially poorer performance on two of the datasets (Hill Valley with noise and Hill Valley without noise). We discuss a likely explanation in §3.2.

Validating TPOT-NN's neural network estimators
The performance advantages of DL estimators are largely derived from their ability to fit complex nonlinear objectives, which is a consequence of stacking multiple neural layers in a serial configuration. To confirm that the MLP estimator included in TPOT-NN leverages this advantage, we used the TPOT-NN API to design a logistic regression (LR) classifier, which is functionally equivalent to the MLP classifier with the intermediate ('hidden') layer removed-a shallow estimator implemented identically to the TPOT-NN MLP. We then ran a series of experiments testing each of these classifiers alone (i.e., only using TPOT to optimize model hyperparameters, and not to construct pipelines consisting of multiple operators). The results of these experiments are summarized in Figure 3. As a means for comparison, we visualized these results alongside experiments using the full TPOT-NN implementation, to confirm that the inclusion of all TPOT operators results in further improvements to classification accuracy.
The results of these experiments support this intuition: TPOT-NN's implementation of MLP classifiers performs better than the identically implemented LR classifier in most cases. Enabling all operators available to TPOT-NN results in an even more dramatic performance increase in addition to a remarkable improvement in the consistency of results on individual datasets. Together, these observations support the claims that (a.) TPOT-NN's new estimators leverage the increased estimation power characteristic of deep neural network models, and (b.) TPOT-NN provides an effective solution to improve on the results yielded by "naïve" DL models on simple classification tasks.

Training duration of TPOT-NN models
The total training time for TPOT pipelines ranged from 4h 22m to 8d 19h 55m, with a mean training time of 1d 0h 49m (± 1d 13h 28m). Table 1 and Figure 4 show how training times vary across the different experiments. To provide a basis of comparison for standard TPOT, we include training time for pipelines consisting of a single shallow estimator only, referred to as "Shallow (single estimator)", where TPOT uses GP solely for the purpose of optimizing hyperparameters on a pipeline containing a single estimator. Both configurations involving NN estimators required a substantially larger amount of time to train a single pipeline on average: an increase of 629% for NN pipelines and 336% for TPOT-NN piplelines versus standard TPOT. Future studies on larger datasets should be performed to establish how this relationship scales with respect to training set size.

Structural topologies of pipelines learned by GP
TPOT assembles pipelines that consist of multiple operators-possibly including multiple classifiers or regressors in addition to feature selectors and feature transformers-to achieve better performance than individual machine learning estimators (Olson et al., 2016a). Since the estimation capacity of simple feedforward neural networks is a monotonic function of network depth, we sought to determine whether TPOT can automatically construct deep architectures by stacking shallow estimators in the absence of a priori instruction to do so.
When TPOT-NN was forced to build pipelines comprised only of feature selectors, feature transformers, and logistic regression estimators, it did indeed construct pipelines consisting of stacked arrangements of logistic layers that strongly resemble well-known DL models. The following Python code is the output of one of these, selected at random from the pool of LR-only TPOT-NN pipelines (hyperparameters have been removed for readability): # Average CV score on the training set was: 0.9406477266781772 exported_pipeline = make_pipeline( make_union( StackingEstimator(estimator=make_pipeline( StackingEstimator(estimator=PytorchLRClassifier(...)), # LR1 StackingEstimator(estimator=PytorchLRClassifier(...)), # LR2 PytorchLRClassifier(...) # LR3 )), FunctionTransformer(copy) # Identity (skip) ), PytorchLRClassifier(...) # LR4 ) The structure of this pipeline is virtually identical to a residual block-one of the major innovations that has led to the success of the ResNet architecture. A graphical representation of this pipeline is shown in Figure 5. This suggests that AutoML could be used as a tool for identifying new submodules for larger DL models. We discuss this possibility further in §3.3.

Assessing the tradeoff between model performance and training time
The amount of time needed to train a pipeline is an important pragmatic consideration in real-world applications of ML. This certainly extends to the case of AutoML: The parameters we use for TPOT include 100 training generations with a population size of 100 in each generation, meaning that we effectively evaluate 10,000 pipelines-each of which consists of a variable number of independently optimizable operators-for every experiment (of which there were 1,375 in the present study). As demonstrated in §2.3, we generally expect a pipeline to train (in the case of 'standard' TPOT) in the range of several hours to slightly over 1 day. Our relatively simple MLP implementation (see §5.2) sits at the lower end-complexity wise-of available deep learning estimators, and likewise is one of the fastest to train. Regardless, including the MLP estimator in a TPOT experiment increases the average time to learn a pipeline by at least 4-fold. Users will have to determine, on an individual basis and dependent on the use case, whether the potential accuracy increase of at most several percentage points is worth the additional time and computational investment.
Nonetheless, the data in Figure 2 demonstrate that it is unlikely for a TPOT-NN pipeline to perform worse than a (non-NN) TPOT pipeline. In 'mission critical' settings where training time is not a major concern, TPOT-NN can be expected to perform at least as well as standard TPOT.

AutoML is effective at recovering DL performance consistency
One of the most striking patterns in our results is that the NN-only models (both MLP and LR) yielded highly inconsistent performance on several datasets, especially the two "hill/valley" datasets (see Figure 2). However, this is unsurprising: these datasets contain sequence data, where an estimator must be able to identify a 'hill' or a 'valley' that could occur at any location in the sequence of 100 features. Therefore, the decision boundary of an estimator over these data is being optimized over a highly non-convex objective function. Standard DL optimization algorithms struggle with problems like these while heuristic and ensemble methods tend to perform well more consistently, explaining the large difference in classification accuracy variance between NN-only and TPOT-NN experiments.

AutoML as a tool to discover novel DL architectures
Based on the results we describe in §2.4, AutoML (and TPOT-NN, in particular) may be useful for discovering new neural network "motifs" to be composed into larger networks. For example, by repeating the internal architecture shown in Figure 5 to a final depth of 152 hidden layers and adjusting the number of nodes in those layers, the result is virtually identical to the version of ResNet that won 1st place in 5 categories at two major image recognition competitions in 2015 (He et al., 2016). In the near future, we plan to investigate whether this phenomenon could be scaled into a larger, fully data-driven approach for generating modular neural network motifs that can be composed into models effective for a myriad of learning tasks.

Future work on integrating AutoML and DL
Since one of our primary goals in this work was to provide a baseline for future development of deep learning models in the context of AutoML, the two PyTorch models we have currently built (logistic regression and MLP) are structurally simple. Future work on TPOT-NN will allow expansion of its functionality to improve the capabilities of the existing models as well as incorporate other, more complex architectures, such as convolutional neural networks, recurrent neural networks, and deep learning regressors.

Conclusions
AutoML and DL are immensely useful tools for approaching a wide variety of inductive learning tasks, and it is clear that both hold strengths and weaknesses for specific use cases. Rather than viewing them as competing methods, we instead propose that the two can work synergistically: For at least the cases we explored in this study (classification on 6 well-characterized datasets with relatively simple feature correlations), the addition of mul-tilayer perceptron classifiers into the pool of available operators improves the performance of AutoML. Since such learned pipelines often explicitly include feature selection and feature transformation operators, they provide a feasible mechanism for improving interpretability of models that make use of DL.
Currently, use of these DL estimators in TPOT significantly increases training time for pipelines, which likely will limit their applications in many situations. Nonetheless, this suggests a multitude of novel directions for methodological research in machine learning and artificial intelligence. TPOT-NN serves as both an early case study as well as a platform to facilitate DL+AutoML research in a reproducible, transparent manner that is open to the scientific community.

Datasets and experimental setup
We evaluated TPOT-NN on 6 well-studied publicly available datasets that were used previously to evaluate TPOT's base implementation (i.e., no ANN estimators), shown in Table 3. All datasets are contained in the PMLB Python package. Hill Valley with noise and Hill Valley without noise consist of synthetic data; the rest are real data. The number of data points, data types for each feature (i.e., binary, integer, or floating-point decimal), number of features, and number of target classes are variable across the 6 datasets.

Parameter
Options Description  Table 2: Configuration options used to construct the TPOT/TPOT-NN experiments in this study. The "included estimator" and "TPOT enabled" parameters were used for testing and validation of the TPOT-NN models.
We performed 720 TPOT experiments in total, corresponding to all combinations of the configuration parameters shown in Table 2. All configurations were run with 5 replicates on each of the 6 datasets listed in Table 3, resulting in 30 experiments per configuration. We used an 80%/20% train/test split on the datasets and scored pipelines based on classification accuracy with 5-fold cross-validation. We constructed logistic regression (LR) and multilayer perceptron (MLP) models in PyTorch to serve as neural network models. LR and MLP are largely considered the two simplest neural network architectures, and are therefore suitable for initial evaluation of new machine learning tools based on ANNs. Since a Scikit-learn LR model is included in standard TPOT, we are able to directly compare the two LR implementations to validate that the PyTorch models are compatible with the TPOT framework, and to quantify the performance variation due to differences in the internal implementations of equivalent models. To allow for a similar comparison for MLP, we merged Scikit-learn's MLP model into TPOT, which has been omitted from the allowable operators in the past due to lengthy training times and inconsistent performance compared to MLPs constructed using dedicated deep learning libraries (such as PyTorch).

TPOT-NN
TPOT users can control the set of available operators-as well as the trainable parameters and the values they can assume-by providing a 'configuration dictionary' (a default configuration dictionary is used if the user does not provide one). We coded the new NN estimators for TPOT within the main TPOT codebase, but provided a separate configuration dictionary that includes the NN estimators along with all default operators. We wrote the new TPOT-NN models in PyTorch, but the TPOT-NN API could be adapted to other neural computing frameworks. Since TPOT requires that all estimators implement an identical interface (compatible with Scikit-learn conventions for fitting a model and transforming data with the fit model), we wrapped the PyTorch models in classes that implement the necessary methods.
Users can also direct TPOT to utilize the NN models by providing a 'template string' instead of a configuration dictionary. A generic template string for MLP might look like "Selector-Transformer-PytorchMLPClassifier", instructing TPOT to fit a series of 3 operators (and their associated hyperparameters): any feature selector, followed by any feature transformer, followed by an instance of the PyTorch MLP model. When no template string is used, TPOT has the ability to learn pipelines with more complex structures.

Hardware and high-performance computing environment
All experiments were run on a high-performance computing (HPC) cluster at the University of Pennsylvania. Each experiment was run on a compute node with 48 available CPU cores and 256 GB of RAM. Job scheduling was managed using IBM's Platform Load Sharing Facility (LSF). All experiments involving PyTorch neural network estimators were run on nodes equipped with NVIDIA ® TITAN GPUs.

Code and Data Availability
TPOT-NN is a submodule included with the full TPOT Python distribution, which is freely available on the Python Package Index (PyPI) and through GitHub [REF]. Due to the substantially increased training time of the neural network models, users must explicitly enable the use of the TPOT-NN estimators by passing the parameter config='TPOT NN' when instantiating a TPOT pipeline. The code we used to evaluate TPOT-NN is available on GitHub in a separate repository [REF]. A frozen copy of all code, data, runtime output, and trained models is also available on FigShare [REF].