Introduction

The aeronautical sector is in a continuous fierce competition to shorten time-to-market cycles and to offer greener and more efficient aircraft. This is because aircraft manufacturers want to keep their industrial leadership, but also due to the targets that governments impose to the sector. For instance, in Europe, the High Level Group on Aviation Research (HLGAR) provided several guidelines stated at the ACARE (Advisory Council for Aeronautics Research in Europe) 2050 [1] flight path. To advance in the design of better aircraft in reduced times, it is required to introduce new tools and technologies in the design process. In particular, the CFD simulations of complex configurations usually require several hours using hundreds of processors. In an earlier stage of the design process, computational fluid dynamics (CFD) simulations could be substituted by surrogate models able to predict the aerodynamic performance within reasonable precision margins. In the last few years, as will be detailed in next section, there have been several applications of certain surrogate models to predict pressure curves or global aerodynamic coefficients, such as lift or drag, of aeronautical configurations. However, to our knowledge, there is not an integrated comparison of several surrogate models, as the one proposed in this paper, using the same aircraft configurations and databases, so conclusions on the performance of each model can be extracted.

Therefore, the objective of this paper is to make a comparison of different surrogate regression models for aerodynamic coefficient prediction in different aeronautical configurations. In particular, three aeronautical configurations have been used, a NACA0012 airfoil, a RAE2822 airfoil and 3D DPW wing from the AIAA (American Institute of Aeronautics and Astronautics) Drag Prediction Workshop.

The paper is structured as follows: “Brief review of the state-of-the-art” presents a review of the state-of-the-art, focusing on surrogate modelling applications for aerodynamic analysis and design. “Surrogate modelling approaches for regression” theoretically describes the methods to be compared. “Numerical results” shows the numerical results and finally, “Conclusions and future research” presents the conclusions. As annexes at the end of this paper, the complete databases information is provided.

Brief review of the state-of-the-art

Generally speaking, surrogate modelling refers to a group of techniques that make use of previously obtained sampled data to build surrogate models, which are subsequently used to predict the value of variables at new points in the design space. The use of machine learning methods to make predictions is not new in markets such as finances or insurance. For instance, more than 10 years ago, it is possible to find applications such as [2] where authors explore the performance of credit scoring using two data mining techniques, classification and regression tree (CART) and multivariate adaptive regression splines (MARS). In [3] it is performed a comparative study of prediction performances of an artificial neutral network (ANN) model against a linear prediction model like a linear discriminant analysis (LDA) with regards to forecasting corporate credit ratings from financial statement data. In addition, other scientific publications [4,5,6] focused on stock market prediction using ANNs.

In the aeronautical sector, there have been also applications of surrogate modelling techniques mainly for aerodynamic analysis and optimization. The application to the aerospace field, and particularly, to aerodynamic data prediction based on CFD, wind tunnel and flight testing data, can allow a first stage exploration of new areas in the design space, without the need of expensive simulations, wind tunnel or flight testing, and in this way, reduce the number of required experiments. For instance, with respect to the so called physics models, Kriging [7,8,9,10] and co-Kriging [11,12,13] based models have been applied to multi-objective optimization or multi-disciplinary optimization problems, including also uncertainties management and quantification as in [14, 15].

On the other hand, within the machine learning field, there have been also some application of models based on ANNs or support vector machines (SVMs) for aerodynamic coefficient predictions [16,17,18], aerodynamic design [19,20,21,22] and uncertainty quantification or/and robust design [23, 24]. Other supervised learning methods, such as Bayesian automatic relevance determination (ARD) regression or Bayesian ridge have been applied in [25,26,27] mainly for aerodynamic design and optimization and decision tree-based models have been used in [28, 29].

More recently, deep learning techniques have been also applied to the wing airfoil pressure calibration in [30], and to a multi-fidelity surrogate-based robust optimization framework in [31]. In addition, approximation models based on convolutional neural networks (CNNs) are proposed for flow field predictions in [32,33,34,35].

In summary, machine learning entails powerful information processing algorithms that are relevant for modelling, optimization, and control of fluids. Currently, machine learning capabilities are advancing at an incredible rate, and fluid mechanics is beginning to tap into the full potential of these powerful methods. Many tasks in fluid mechanics, such as reduced-order modelling, shape optimization, uncertainty quantification, and feedback control, may be posed as optimization and regression tasks. Machine learning can dramatically improve optimization performance and reduce convergence time. Machine learning is also used for dimensionality reduction, identifying low-dimensional manifolds and discrete flow regimes, which benefit understanding.

This explains why, in the last few years, there have been an increasing number of applications of surrogate models to predict pressure curves or global aerodynamic coefficients, such as lift or drag, of aeronautical configurations. However, to our knowledge, there is not an integrated comparison of several surrogate models using the same aircraft configurations and databases, so conclusions on the performance of each model can be extracted.

Therefore, the objective of this paper is to make a deep comparison of different surrogate regression models for aerodynamic coefficient prediction in different aeronautical configurations. In particular, three different aeronautical configurations have been used, a NACA0012 airfoil, a RAE2822 airfoil and 3D DPW wing.

The novelty of this work is on the application of surrogate regression models for the aerodynamic coefficients prediction of aeronautical configurations. Although the regression models applied in this paper already exist and have been applied in other sectors such as finances or risk analysis, its application in aerodynamics is still in its infancy. The importance of this research resides on the high computational cost of the computational fluid dynamic simulations. If this computational cost could be reduced using the proposed regression models, it would be possible to speed up the design process of new aircraft configurations, and, moreover, to consider also unconventional aircraft configurations since it would be feasible to consider a high number of design parameters, which is not possible nowadays. However, the success of these techniques in the aeronautical sector and, in particular, in computational aerodynamics, is still not clear and requires further research. If machine learning methods could be successfully used to substitute computational fluid dynamics tools for aerodynamic simulations it would constitute a huge improvement for the aeronautical industry which could use these methods to obtain fast predictions of aircraft or components aerodynamics features and, therefore, speed up the time-to-market of their products. In addition, these methods have also a great potential to exploit aerodynamic data already existing at industries, as for instance from previous simulations, wind tunnel experiments or even flight testing. Therefore, it is worth to investigate the feasibility of these methods for the aeronautical sector and in particular, for aerodynamic prediction.

Surrogate modelling approaches for regression

This section theoretically describes the methods that will be compared in this paper.

Linear models

Linear models are one of the simplest regression algorithms available to data scientists. They predict data based on a linear relationship with the features, such that:

$$ \hat{y}\left( {\vec{w},\vec{x}} \right) = w_{0} + w_{1} x_{1} + w_{2} x_{2} + \cdots + w_{n} x_{n} , $$
(1)

where \(\hat{y}\) is the predicted value, \(x_{i}\) are the features of the dataset and \(w_{i}\) are the coefficients of the linear regression (i = 1, 2…., n). All linear models share this characteristic; however, differences appear in how the coefficients are obtained from the labelled data. In the following paragraphs, some of the most prevalent and used methods within this category will be explained.

Least squares

This method involves fitting the training data to the model by minimizing the sum of the squares of the error of the prediction with respect to the training label, that is:

$$ \mathop {\min }\limits_{w} \left| {\left| {X\vec{w} - \vec{y}} \right|} \right|_{2}^{2} . $$
(2)

With \(X\) being the matrix built by the vectors of features, \(w\) is the vector of coefficients and \(y\) is the training label. Although simple, this objective function poses difficulties when the features of the data to fit are correlated. In this case, the design matrix becomes singular or close to singular and the model becomes highly sensitive to errors in the observed label.

Ridge regression

Ridge regression [36] caters to the problems encountered with multicollinearity in least squares. It does so by introducing a parameter \(\alpha\) that penalizes the absolute value of the coefficients. The new objective function is:

$$ \mathop {\min }\limits_{w} (\left| {\left| {X\vec{w} - \vec{y}} \right|} \right|_{2}^{2} + \alpha \left| {\left| {\vec{w}} \right|} \right|_{2}^{2} ). $$
(3)

The parameter \(\alpha \) will reduce the value of the resulting coefficients as it increases, augmenting the robustness against correlated features.

LASSO (least absolute shrinkage and selection operator)

Least absolute shrinkage and selection operator [37] is mathematically, a linear model with an added regularization term, in such a way that the objective function becomes:

$$ \mathop {\min }\limits_{w} \frac{1}{2n}\left| {\left| {X\vec{w} - \vec{y}} \right|} \right|_{2}^{2} + \alpha \left| {\left| {\vec{w}} \right|} \right|_{1} . $$
(4)

The \(l_{1}\) norm that appears in the objective function makes it so that less non-zero coefficients appear in the final model, by automatically performing feature selection on the dataset combined with a regularization, as seen in ridge regression. Due to the feature selection strategy of the model, LASSO gives further insight into the dataset compared to other similar models, which can be useful when, for instance, the dataset is small and there is model accuracy to be gained by removing non-relevant features to simplify the problem.

LARS (least angle regression)

Least angle regression [38] takes a different approach to other linear methods covered here, as a linear model, it fits a linear combination of coefficients and features to a certain label; however, LARS does so using an iterative algorithm.

Initially, all the coefficients are set to zero, and an analysis is carried out to determine which coefficient is the most correlated with the labels. Then, this coefficient is increased along the slope given by its correlation until some other coefficient has as much correlation with the residual. At that point, one increases both coefficients in their combined least squares direction until there is another coefficient with as much correlation with the residual as the pair. This process is repeated until all the coefficients are included.

LARS is particularly useful when operating with high-dimensional small datasets, efficient and fast to run on high-dimensional small datasets, however, it is particularly sensitive to noise.

Bayesian ridge

Bayesian regression [39], contrary to least squares for instance, does not assume that there is an optimal set of coefficients to satisfy the linear relationship, instead, it yields a posteriori distribution of the model parameters. In this way, a priori knowledge of the coefficients can be included in the model to make a better estimator.

Bayesian ridge, is a Bayesian implementation of the ridge model described in this section. For this model, the prior of the coefficients \(w\) are given by a spherical Gaussian as:

$$ p\left( {w|\lambda } \right) = {\mathcal{N}}(w|0, \lambda^{ - 1} I_{{\text{p}}} ), $$
(5)

where the priors \(w\) and \(\lambda\) are chosen as gamma distribution. Bayesian ridge models tend to produce very similar results to ridge models; however, they also tend to be more robust in cases when limited data is available. Moreover, this model can incorporate prior knowledge of the system and compute the uncertainty associated with it.

Huber regression

Huber regression [40] implements a ridge model with the \({l}_{2}\) norm regularization parameter but introducing a different loss for outliers. The objective function then becomes:

$$ \mathop {\min }\limits_{w,\sigma } \mathop \sum \limits_{i = 1}^{n} \left( {\sigma + H_{\epsilon } \left( {\frac{{X_{i} w - y_{i} }}{\sigma }} \right)\sigma + \alpha \left| {\left| w \right|} \right|_{2}^{2} } \right), $$
(6)

with

$$ H_{\epsilon } \left( z \right) = \left\{ {\begin{array}{l@{\quad}l} {z^{\wedge} 2,} & { \left| z \right| < \epsilon } \\ {2\epsilon \left| z \right| - \epsilon^{2} ,} & {{\text{otherwise}}} \\ \end{array} } \right., $$

where \(\sigma\) is a scaling parameter to be optimized as well.

Huber regression brings outlier tolerance to the ridge model, by introducing a linear loss instead of a quadratic one to reduce the effect of the outliers. A sample is classified as an outlier if the absolute error associated with it has a value larger than a parameter \(\epsilon\) the smaller this parameter is, the more robust the model is to outliers.

ARD (automatic relevance determination) regression

Automatic relevance determination regression [39] is similar to a Bayesian ridge regression with some modifications to the prior introduced for that model. For ARD, the assumption of \(w\) as a spherical Gaussian is dropped, instead, it is assumed an axis parallel, elliptical Gaussian distribution. Then

$$ p\left( {w|\lambda } \right) = {\mathcal{N}}(w|0,A^{ - 1} ), $$
(7)

where \(A\) is a diagonal matrix of coefficients \(\lambda_{i}\), therefore, every coefficient \(w\) has its own standard deviation contrary to Bayesian ridge. ARD regression in practice, leads to sparser coefficients than Bayesian ridge.

Decision trees

Decision trees [41] are supervised learning methods that can be used for classification and regression. They attempt to fit the data by devising a set of decision rules on its features. They can be used for classification and regression problems, modifying only the output data type.

Intuitively, they can be thought of as a series of splits on the features of the dataset, for instance, when classifying a vehicle dataset based on the number of wheels, an initial split may be, if \(\# {\text{Wheels}} < 3 \to {\text{Motorcycle}}\). On the else part of this if statement, one may include \(\# {\text{Wheels}} < 5 \to {\text{Car}}\) and finally anything else may be a truck. The same logic can be applied to regression by assigning numerical values instead of classes to each decision.

One of the main parameters associated with decision trees is the maximum depth. It represents the maximum number of consecutive decisions allowable. This parameter is highly dependent on the size of the dataset and the system complexity. It is important to reach an appropriate depth for the problem at hand, too low depth would yield lower accuracy than acceptable, but also, it is easy to overfit the model by introducing too many splits.

Overfitting is of great concern when working with decision trees. To avoid this issue, usually a tuning process is applied for the model to find the best depth. There are alternative strategies to increase accuracy while avoiding overfitting, these typically fall within the scope of the ensemble learning methods. Ensemble methods employ a combination of models so that the final combination of output is better than any single model by itself reducing the variance. These methods, when applied to decision trees are appropriately called random forests [42] because they combine several trees.

To train a random forest, each tree is trained on a subset of the complete training set and using only a portion of the features available. The reason for using only a subset of the features on any given tree is that there is enough variability between the trees avoiding them being too focused on any given feature in the data. After training each tree, its output is averaged for the output of the forest and a regression problem or a majority voting process is used for classification.

Finally, to introduce even more randomness in the training process to reduce variance, extremely randomized trees or Extra trees [43] can be trained. These differ from random forests in the training process of each tree. Typically, the choice of splitting is decided based on the most discriminative point, however, in an Extra tree model, the splitting rule is chosen as the best from set of randomly generated thresholds. This typically reduces variance at the cost of increasing slightly the bias.

Neural networks

Neural networks have been one of the hot topics in the machine learning community in the recent years and will probably be kept that way for the foreseeable future. They were devised from inspiration of biological neural networks.

The basic unit within a neural network [44] is what is called a neuron. Neurons are very simple units that are characterized by a set of weights with size equal to the number of inputs to the neuron, during execution, the output of the neuron is the linear combination of the weights and the input. One neuron by itself is not a very powerful estimator; however, the power of the neural networks comes from using many of this simple units creating complex relationships which can extend to very difficult classification or regression problems.

Although there are many configurations for networks of neurons, it is outside the scope of this paper to consider other configurations rather than the simple dense network. In this type of network, the neurons are organized in what is known as layers. The first layer, known as the input layer, receives the features of a sample, and each neuron computes the linear combination of its weights with the feature set. This result is passed on to the next layer of the network for an arbitrary number of layers until the last layer or output layer is reached.

The output layer may perform an additional operation known as activation function. These are used in classification problems to produce categorical data from a real output. Among these functions such as sigmoid or tanh can be found.

To train such network, one must define a cost function such as the square error (many others can be used, the cost function is one of the hyperparameters of the model) between the output and the test label. The objective then, is to generate a network that can reduce this error. To that end, one can compute the gradient of the cost function with respect to each weight in the network. Using this, an optimization algorithm may be able to reduce this error.

Finding the weights that minimize the error is not trivial, and there are a wide array of methods and techniques to obtain the best performance possible, from regularization to initialization algorithms to different optimizers. It becomes apparent that there are a large number of hyperparameters that must be chosen to yield the best possible output.

All in all, neural networks provide a lot of flexibility and complexity compared with other machine learning models. However, they are usually more time consuming and require more expertise from the user to provide the best results.

Support vector machines for regression

Support vector machines for regression (SVMr) are a powerful tool used on the machine learning field, and as a modelling tool in many regression problems in engineering. The SVMr can be solved as a convex optimization problem using kernel theory to face nonlinear problems. The SVMr consider not only the prediction error but also the generalization of the model. The SVMr consist of training a model with the form \(y = w^{{\text{T}}} \Phi \left( x \right) + b\) given a set of parameters \( C = \left\{ {\left( {x_{i} ,y_{i} } \right),i = 1,2, \ldots ,l} \right\}\), to minimize a general risk function of the form:

$$ R\left[ f \right] = \frac{1}{2}\|w \|^{2} + \frac{1}{2}C\mathop \sum \limits_{i = 1}^{l} L\left( {y_{i} ,f\left( x \right)} \right), $$
(8)

where w controls the smoothness of the model, \(\Phi \left( x \right)\) is a function of projection of the input space to the feature space, b is a parameter of bias, \(x_{i}\) is a feature vector of the input space with dimension \(N\), \(y_{i}\) is the output value to be estimated and \(L\left( {y_{i} ,f\left( x \right)} \right)\) is the loss function selected. In this paper, the L1-SVR (L1 support vector regression) is used, characterized by an ε-insensitive loss function

$$ L\left( {y_{i} ,f\left( x \right)} \right) = \left| {y_{i} - f\left( {x_{i} } \right)} \right|_{\varepsilon } . $$
(9)

To train this model, the following optimization problem has to be solved

$$ \min \left( {\frac{1}{2}\|w \|^{2} + \frac{1}{2}C\mathop \sum \limits_{i = 1}^{l} \xi_{i} + \xi_{i}^{*} } \right), $$
(10)

subject to:

$$ \begin{aligned} &y_{i} - w^{{\text{T}}} \Phi \left( x \right) - b \le \varepsilon + \xi_{i} ,\quad i = 1, \ldots ,l \\ &- y_{i} + w^{{\text{T}}} \Phi \left( x \right) + b \le \varepsilon + \xi_{i}^{*} ,\quad i = 1, \ldots ,l \\ &\xi_{i} ,\xi_{i}^{*} \ge 0,\quad i = 1, \ldots ,l. \\ \end{aligned} $$
(11)

To do this, a dual form is usually applied, obtained from the minimization of the Lagrange function that joins the function to minimize and the restrictions. The dual form is

$$ {\text{max}} - \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{l} \left( {\alpha_{i} + \alpha_{i}^{*} } \right)\left( {\alpha_{j} + \alpha_{j}^{*} } \right)K\left( {x_{i} + x_{j} } \right) - \varepsilon \mathop \sum \limits_{i = 1}^{l} \left( {\alpha_{i} + \alpha_{i}^{*} } \right) + \mathop \sum \limits_{i = 1}^{l} y_{i} \left( {\alpha_{i} + \alpha_{i}^{*} } \right) $$
(12)

subject to

$$ \mathop \sum \limits_{i = 1}^{l} \left( {\alpha_{i} - \alpha_{i}^{*} } \right) = 0; \quad \alpha_{i} ,\alpha_{i}^{*} \in \left[ {0,C} \right]. $$
(13)

In addition to the restrictions, also must be taken in account the Karush–Kuhn–Tucker conditions and obtain the bias value. In the dual formulation, it is important to emphasize the apparition of the kernel function \({ }K\left( {x_{i} ,x_{j} } \right)\), which is equivalent to the scalar product \({ }\langle \Phi \left( {x_{i} { }} \right),\Phi \left( {x_{j} { }} \right)\rangle\). In our case, the kernel function is a Gaussian function:

$$ K\left( {x_{i} ,x_{j} } \right) = \exp \left( { - \gamma \cdot \| x_{i} - x_{j}\|^{2} } \right). $$
(14)

The final form of the regression model depends on the Lagrange multipliers \({ }\alpha_{i} ,\alpha_{i}^{*}\), following the expression:

$$ f\left( x \right)\mathop \sum \limits_{i = 1}^{l} \left( {\alpha_{i} - \alpha_{i}^{*} } \right)K\left( {x_{i} ,x} \right) + b. $$
(15)

In this way, the SVMr model depends on three parameters, \(\varepsilon ,\) \(C\) and \({ }\gamma\). \(\varepsilon\) controls the error margin permitted for the model, as can be seen in Eqs. (10), (11), C controls the number of outliers allowed on the optimization of the function Eq. (10). Finally, \(\gamma\) determines the Gaussian variance for the kernel. Depending on the selection of these values, the model can have a different performance. To obtain the best SVM performance, a search of the most suitable combination of these three parameters must be carried out, usually using cross-validation techniques over the training set. To reduce the computational time of this process, different methods have been proposed in the literature to reduce the search space related to these parameters. In this case, it has been applied the one developed by Ortiz-García et al. [45] which has proven to require pretty short search times.

Numerical results

Three test cases have been considered based on three well-known geometries (see figure below): the NACA0012 airfoil [46], the RAE2822 airfoil [47] and the DPW-W1 wing from the 3rd AIAA Drag Prediction Workshop [48] (Fig. 1).

Fig. 1
figure 1

NACA0012, RAE2822 and DPW grids

Information about the aerodynamic databases

CFD data generation

CFD computations for the databases generation were performed with the DLR TAU code [49] and the grids were generated with Centaur [50]. The TAU-Code solves the compressible, three-dimensional Reynolds-averaged Navier–Stokes equations using a finite volume formulation. The TAU-Code is based on a hybrid unstructured-grid approach, which makes use of the advantages of semi-structured prismatic grids in the viscous shear layers near walls, and the flexibility in grid generation offered by tetrahedral grids in the surrounding flow volume. A dual-grid approach with an edge-based data structure is used to make the flow solver independent from the cell types used in the initial grid.

The TAU-Code consists of several different modules, including:

  • the grid partitioner, which splits the primary grid into n number of subgrids for n processors;

  • the preprocessor module, which uses the information from the initial grid to create a dual grid and second, coarser grids for multi-grid;

  • The solver module, which performs the flow calculations on the dual grid;

  • the postprocessing module, which is used to convert results to formats usable by popular visualization tools.

Together, all modules are available with python interfaces for computing complex application, e.g. unsteady cases, complete force polar curves or fluid–structure couplings in an automatic framework. Furthermore, it eases the usage on highly massive parallel computers to execute applications.

NACA0012 database

This database contains 185 samples and was generated by LHS (Latin Hypercube Sampling) with Mach varying from 0.1 to 0.9 (incrementing 0.1 each time) and AoA (Angle of Attack) varying from 0 to 20 (incrementing 1 each time). The Mach number is a dimensionless quantity in fluid dynamics representing the ratio of flow velocity past a boundary to the local speed of sound. Angle of attack (AOA) is the angle between the oncoming air or relative wind and a reference line on the airplane or wing. It is also important to mention that if all the 185 samples had run correctly (with the CFD tool used for generating the database) there had not been empty spaces in the plane. This missing points are due to simulations that did not converge with the CFD tool, that means that the lift or drag coefficient did not reach a stable value after a considerable number of iterations of the flow solver.

The following figures show the distribution of Mach versus AoA of the database samples, lift and drag coefficient curves of the database samples. The lift coefficient (Cl) is a dimensionless quantity that is used to express the ratio of the lift force to the force produced by the dynamic pressure times the area. The drag coefficient (Cd) is also a dimensionless quantity that is used to quantify the drag or resistance of an object in a fluid environment, such as air or water. It is used in the drag equation in which a lower drag coefficient indicates the object will have less aerodynamic drag. The drag coefficient is always associated with a particular surface area (Figs. 2, 3).

Fig. 2
figure 2

Exploring NACA0012 database (Mach versus AoA distribution of the database samples)

Fig. 3
figure 3

Exploring NACA0012 database (left: representation of the database samples in the lift-AoA space, right: representation of the database samples in the drag-AoA space)

RAE2822 database

This database contains 122 samples and was generated by LHS with Mach varying from 0.1 to 0.9 (incrementing 0.1 each time) and AoA varying from 0 to 15 (incrementing 1 each time). It is also important to mention that if all the 122 samples had run correctly (with the CFD tool used for generating the database) there had not been empty spaces in the plane. This missing points are due to simulations that did not converge with the CFD tool, that means that the lift or drag coefficient did not reach a stable value after a considerable number of iterations of the flow solver.

The following figures show the distribution of Mach versus AoA of the database samples, lift and drag coefficient curves of the database samples (Figs. 4, 5).

Fig. 4
figure 4

Exploring RAE2822 database (Mach versus AoA distribution of the database samples)

Fig. 5
figure 5

Exploring RAE2822 database (left: representation of the database samples in the lift-AoA space, right: representation of the database samples in the drag-AoA space)

DPW database

This database contains 100 samples and was generated by LHS with Mach varying from 0.1 to 0.8 (varying 0.1 each time) and AoA varying from 0 to 15 (varying 1 each time). Again, it is important to mention that if all the 100 samples had run correctly (with the CFD tool used for generating the database) there had not been empty spaces in the plane. This missing points are due to simulations that did not converge with the CFD tool, that means that the lift or drag coefficient did not reach a stable value after a considerable number of iterations of the flow solver. In this case, it can be also observed that for the highest numbers of Mach and AoA, the probability of the solver to diverge is bigger, due to the instabilities that are expected in the flow fields.

The following figures show the distribution of Mach versus AoA of the database samples, lift and drag coefficient curves of the database samples (Figs. 6, 7).

Fig. 6
figure 6

Exploring DPW database (Mach versus AoA distribution of the database samples)

Fig. 7
figure 7

Exploring DPW database (left: representation of the database samples in the lift-AoA space, right: representation of the database samples in the drag-AoA space)

In addition, the following table shows some additional statistics on the databases. The count, mean, min, and max rows are self-explanatory. The std row shows the standard deviation (which measures how dispersed the values are). The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile indicates the value below which a given percentage of observations in a group of observations falls (Table 1).

Table 1 Statistics of the aerodynamic databases

Information about the strategy followed

In this research, the following strategy was followed:

  1. 1.

    First, 15 different regression models were selected to be compared on the same databases. In this step, the split between training and testing sets was done with a pure random sampling method and considering 80% of the initial samples for the train set and the other 20% for the test set. A standard scaling have been applied to the databases prior the training of the model.

  2. 2.

    The four best performing models were then selected for further analysis and cross-validation, to make the prediction results more robust and less dependent on the initial dataset split.

The following pictures show the main steps in this strategy (Figs. 8, 9)

Fig. 8
figure 8

First step of the followed strategy (full models comparisons with training and testing split)

Fig. 9
figure 9

Second step of the followed strategy (best models comparisons with cross-validation and parameters tuning)

Information about the methods and the comparison metrics

With the aim to provide a broad comparison of the existing regression methods, 15 different approaches have been selected. The details of these methods are displayed in the following Table 2.

Table 2 Brief description of the specific methods used in the comparison

The metrics that will be used for models comparison are described in the following Table 3.

Table 3 Brief description of the specific metrics used in the comparison

Models comparison

In this section, the selected 15 methods have been applied to the three databases and compute the comparison metrics detailed in the previous section.

NACA0012 database

From these results, the following conclusions can be drawn (Fig. 10):

  • The best model in terms of minimum MAE, RMSE and ME, and maximum R2 and EVS is the SVR with a Radial Basis function kernel. This is the case in both Cl and Cd predictions. For this model, the RMSE is 0.084 for the Cl and 0.013 for the Cd, which is a reasonable accuracy.

  • Decision tree and Extra tree models also provide very good metric results.

  • The order of models in terms of performance remains almost invariable in all comparison metrics.

  • Linear regression models are not able to provide good results, as it was expected.

  • The kernel function used in the SVR model has a very strong influence on the metric results, for instance SVR with a polynomial kernel achieved a R2 of 0.68 compared to the 0.95 of the SVR with a Radial Basis function kernel.

Fig. 10
figure 10figure 10

Models comparison results on the NACA0012 database (left: Cl prediction, right: Cd prediction). Metrics are R2 (determination coefficient), MAE (mean absolute error), RMSE (root mean squared error), ME (max error) and EVS (explained variance score)

RAE2822 database

From these results, similar conclusions as in the previous test case can be drawn (Fig. 11):

  • The best model in terms of minimum RMSE and ME, and maximum R2 and EVS is the SVR with a Radial Basis function kernel. This is the case in both Cl and Cd predictions. For this model, the RMSE is 0.054 for the Cl and 0.012 for the Cd, which is a reasonable accuracy.

  • Decision Tree, Extra tree and MLP_relu models also provide very good metric results.

  • The order of models in terms of performance remains almost invariable in all comparison metrics.

  • Linear regression models are not able to provide good results, as it was expected.

  • As it happened also in the NACA0012 database, the kernel function used in the SVR model has a very strong influence on the metric results, for instance SVR with a polynomial kernel achieved a R2 of 0.60 compared to the 0.98 of the SVR with a Radial Basis function kernel.

  • In RMSE and ME metrics, there is a considerable difference between the two best performing models (SVR_rbf and Decision tree or Extra tree, depending of Cl or Cd prediction).

Fig. 11
figure 11figure 11

Models comparison results on the RAE2822 database (left: Cl prediction, right: Cd prediction). Metrics are R2 (determination coefficient), MAE (mean absolute error), RMSE (root mean squared error), ME (max error) and EVS (explained variance score)

DPW database

From these results, again similar conclusions as in the previous tests cases can be drawn (Fig. 12):

  • The best model in terms of minimum RMSE and ME, and maximum R2 and EVS is the SVR with a Radial Basis function kernel. This is the case in both Cl and Cd predictions. For this model, the RMSE is 0.020 for the Cl and 0.008 for the Cd, which is a reasonable accuracy.

  • Decision tree and Extra tree models also provide very good metric results.

  • The order of models in terms of performance remains almost invariable in all comparison metrics.

Fig. 12
figure 12figure 12

Models comparison results on the DPW database (left: Cl prediction, right: Cd prediction). Metrics are R2 (determination coefficient), MAE (mean absolute error), RMSE (root mean squared error), ME (max error) and EVS (explained variance score)

As additional information, the parameters for the best four performing models are displayed in the following Table 4. Remember that at this stage, no optimization of the model parameters has been performed.

Table 4 Initial model parameters

Model parameters optimization and cross-validation of the best four models

Now, with the four best performing models (SVR_rbf, Decision tree, Extra tree y MLP_relu, cross-validation was applied to be sure that the results are not affected by the training and testing datasets split and optimization to the model parameters (using a grid search technique) to find out what is the maximum accuracy one can get.

The following pictures show the metrics results for each of the models in the databases.

NACA0012 database

As can be observed, SVR_rbf remains in the first position in all metrics, for instance regarding R2 metric, which has now increased up to 0.99 for both Cl and Cd (before it was 0.95 for Cl and 0.98 for Cd) (Fig. 13).

Fig. 13
figure 13figure 13

Best 4 models comparison results on the NACA0012 database (left: Cl prediction, right: Cd prediction). Metrics are R2 (determination coefficient), MAE (mean absolute error), RMSE (root mean squared error), ME (max error) and EVS (explained variance score)

It is also important to notice the strong performance increment of the MLP_relu model, after the parameter optimization. For instance, in terms of R2, this model achieved values of 0.99 for both coefficients prediction and before it was 0.84 for Cl and 0.93 for Cd.

The final parameters used for the model can be observed in the Table 5 below.

Table 5 Models parameters (NACA0012 database)

The following pictures show the regression plot of these four models. Again, the outstanding behaviour of the SVR_rbf model is confirmed (Figs. 14, 15).

Fig. 14
figure 14

Regression plots of the best four models for Cl prediction (NACA0012 database)

Fig. 15
figure 15

Regression plots of the best four models for Cd prediction (NACA0012 database)

RAE2822 database

As can be observed, SVR_rbf and MLP_relu are in the first positions in all metrics. It is not possible to draw conclusions on which of these two models behave better since the performance varies when predicting Cl or Cd. Anyway, the differences between these two models in terms of R2 are almost neglectable (Fig. 16).

Fig. 16
figure 16figure 16

Best four models comparison results on the RAE2822 database (left: Cl prediction, right: Cd prediction). Metrics are R2 (determination coefficient), MAE (mean absolute error), RMSE (root mean squared error), ME (max error) and EVS (explained variance score)

These results were obtained with the following model parameters (Table 6).

Table 6 Models parameters (RAE2822 database)

The following pictures show the regression plot of these four models. Again, the outstanding behaviour of the SVR_rbf and MLP_relu models is confirmed (Figs. 17, 18).

Fig. 17
figure 17

Regression plots of the best four models for Cl prediction (RAE2822 database)

Fig. 18
figure 18

Regression plots of the best four models for Cd prediction (RAE2822 database)

DPW database

As can be observed, SVR_rbf remains in the first position in all metrics, for instance regarding R2 metric, which has now increased up to 0.99 for both Cl and Cd (before it was only 0.95 for Cd) (Fig. 19).

Fig. 19
figure 19figure 19

Best four models comparison results on the DPW database (left: Cl prediction, right: Cd prediction). Metrics are R2 (determination coefficient), MAE (mean absolute error), RMSE (root mean squared error), ME (max error) and EVS (explained variance score)

It is also important to notice the strong performance increment of the MLP_relu model, after the parameter optimization. For instance, in terms of R2, this model achieved values of 0.99 for Cl and 0.98 for Cd and before it was 0.92 for Cl and 0.13 for Cd.

The final parameters used for the model can be observed in the Table 7 below.

Table 7 Models parameters (DPW database)

The following pictures show the regression plot of these four models. Again, the outstanding behaviour of the SVR_rbf model is confirmed (Figs. 20, 21).

Fig. 20
figure 20

Regression plots of the best four models for Cl prediction (DPW database)

Fig. 21
figure 21

Regression plots of the best four models for Cd prediction (DPW database)

Conclusions and future research

This paper focuses on making a deep comparison of different surrogate regression models for aerodynamic coefficient prediction in different aeronautical configurations. In particular, three different aeronautical configurations have been used, a NACA0012 airfoil, a RAE2822 airfoil and 3D DPW wing.

From the obtained results, the following conclusions can be summarized:

  • The best models in terms of the metrics analysed are SVR with a radial basis function kernel and a multi-layer perceptron neural network with the rectified unit function as the activator function.

  • The superiority of the support vector regression model is justified by its better generalization performance than other regression methods, which derives to a better precision accuracy. In addition, it is efficient for high-dimensional spaces and when the number of samples is limited, as it happens in computational aerodynamics, due to the computational cost of generating each sample of the training database. Moreover, the computational complexity of SVR does not depend on the dimensionality of the input space, which is also an advantage in this application field.

  • Model parameters optimization is crucial to obtain good accuracy, especially for the MLP_relu model, where the metrics value drastically changed when optimizing the number of hidden layers, learning rate, etc.

  • The order of models in terms of performance remains almost invariable in all comparison metrics amongst the three databases studied.

  • Linear regression models are not able to provide good results, as it was expected.

  • The application of surrogate regression models for aerodynamic coefficients prediction is feasible and has a tremendous potential to reduce the computational time of CFD simulations, especially when considering aerodynamic design loops.

There is still further potential to be exploited: a clever generation of the samples in the initial dataset (not LHS), the use of more robust model validation strategies, such as cross-fold validation, the combination of data with multi-fidelity within the aerodynamic database (eg. CFD, wind tunnel, flight testing data, etc.), the comparison of different regression models and tuning these parameters, etc. These issues will be undertaken in future works.

In addition, the use of deep learning techniques and the comparison against traditional machine learning techniques will be considered in the near future, since recent scientific publications have demonstrated their potential in other sectors. In particular, the application of deep learning methods to computational aerodynamics will be investigated during next years in the frame of an European project titled “Machine learning and data-driven approaches for aerodynamic analysis and uncertainty quantification” (acronym ML4AERO) with the collaboration of several research institutions (INTA, DLR, ONERA, CIRA, FOI, AIRBUS, OPTIMAD, IRT, INRIA and the University of Twente). There, the feasibility of applying deep learning methods and convolutional neural networks to aerodynamic analysis and design will be analysed.

Finally, it is important to mention that all databases used in this paper are freely available for the scientific community.