1 Introduction

In various realms of experimental research, particularly within the domains of fluid dynamics and combustion, extensive postprocessing efforts are imperative for distilling targeted information from raw imaging data. The utilization of segmentation and edge or object detection techniques is crucial to derive important flow physical quantities from output data obtained through imaging methods. Examples of features that may be sought include, the coordinates of a dye plume (Yadav 2018), oil or smoke patterns (Arivoli 2023), the position of a shock wave (Kováics et al. 2023), vortical structures (Lindner et al. 2020) or particle clusters (Metzger et al. 2022), species concentrations and iso-surfaces (Zheng et al. 2022), interfaces and zones (Reuther and Kähler 2018). The methodologies applied in this context span a range of experimental techniques (Tropea et al. 2007), including chemiluminescence (CL) (Guethe et al. 2012), laser induced fluorescence (LIF) (Eghtesad et al. 2024) and phosphorescence (LIP) (Charogiannis 2013), filtered Rayleigh scattering (FRS) (Doll et al. 2023), thermography (Astarita et al. 2006), schlieren imaging and shadowgraphy (Settles and Hargather 2017), high-speed photography (Versluis 2013), dye injection (Di et al. 2022), smoke visualization (Willmott et al. 1997), oil film measurements (Cai et al. 2022) as well as experiments involving shear- or temperature/pressure-sensitive liquid crystals (SLC, TLC) (Ireland and Jones Jul 2000) and paints (TSP, PSP) (Gregory et al. 2008).

A good postprocessing algorithm for image segmentation must be accurate, computationally cheap and ideally not require user-based parameter tuning, such that a large set of images can be evaluated with the same settings, without the need for human interference. When applications become more complex, some of the above aspects often have to be traded off against one another when conventional approaches based on brightness gradients in images are used.

Recently, Machine Learning (ML) has been increasingly integrated into problem solving strategies in fluid dynamics (Brunton et al. 2020) and combustion (Zhou et al. 2022). Some example applications relevant to combustion include data-driven physical modeling and estimation of properties (Eckart et al. 2022; Gonzáilez et al. 2020; Joo et al. 2015), reconstruction of scalar quantities and fields from measurement data (Barwey et al. 2022; Teutsch et al. 2023; Clark Di Leoni et al. 2023), object detection and classification (Roncancio et al. 2022; Ryu and Kwak 2021; Pulido et al. 2021), image segmentation (Vennemann and Rösgen 2020; Kuzu et al. 2022; Kashir et al. 2021), and prediction of events in real time that can be further used as inputs for control systems (Cellier et al. 2021; Li et al. 2022; Aliramezani et al. 2022). However, ML-based methods have not yet found their way into flow diagnostics as much as they have in other disciplines that deal with image processing techniques, such as robotic perception or biomedical imaging, although ML might offer significant benefits and could provide a valuable alternative to existing conventional techniques.

Fig. 1
figure 1

Comparison of OH-PLIF image data quality under three different operation conditions. The images are false colored

Flame and flame front detection tasks have already been carried out with ML-based models for high-speed camera imaging. In the context of industrial burners and furnaces, Landgraf et al. (2023) and Groß et al. (2021) conducted investigations for segmentation of the whole flame brush for monitoring purposes. Sun et al. (2022) did flame edge detection for premixed, diffusion and energetic material flames at atmospheric pressures in thermography, single-channel and RGB images with fairly good contrast to the background. In the context of combustion diagnostics research and for more demanding image conditions, there exist some applications of ML-based flame front detection and segmentation in optical SI-engines by Petrucci et al. (2022); Petrucci et al. (2022) and Rufino et al. (2023) for pressures up to 0.38 MPa. However, to the knowledge of the authors, no ML-based application considering planar laser induced fluorescence images of the OH radical (OH-PLIF) and data recorded at high pressure conditions exists in the literature. This is particularly relevant because the OH molecules are subject to collisional quenching at elevated pressures, leading to a loss of LIF signal, which significantly lowers the signal-to-noise ratio (SNR) of the image data, cf. Fig. 1, and thereby impedes the extraction of the flame fronts. Especially in lower SNR conditions, conventional methods based on gradients of the OH signal possess some deficiencies. Obtaining and evaluating accurate instantaneous flame fronts is of central importance for precisely deriving local quantities, such as flame curvature, to study the flame dynamics and instability mechanisms. Hence, there is a need for more robust and reliable flame front extraction methods.

The aim of this paper is to show the possibilities and establish simple ML-based methods as an alternative to other advanced and very tailored conventional flame front detection algorithms. For this purpose, the present work delves into the methodology for instantaneous flame front extraction from OH-PLIF images of turbulent premixed lean hydrogen-methane (H2–CH4) flames at elevated pressures. The experimental setup and data described in Faldella et al. (2023) was used. Further details on the employed setup and experimental conditions are summarized in the supplementary material for the interested reader. Firstly, established conventional extraction techniques and their limitations are discussed in Sect. 2, followed by an overview of ML-based approaches in Sect. 3. For the ML-based approaches, we motivate the powerful idea of convolutional filters in the context of neural networks and supervised learning. We then highlight different model categories and possible models that can be used for the task of extracting flame fronts. In Sect. 4, some important implementations are highlighted. The performance of a selection of ML-based flame front detection models and their behaviour for low SNR are evaluated and put into perspective with conventional benchmark models in Sect. 5. Finally, in Sect. 6, the main findings and conclusions are summarized.

2 Conventional flame front detection

In principle, two pixel-based techniques can be distinguished for the extraction of structures in images; segmentation or edge detection. Segmentation is the process by which all pixels in the image are assigned to different classes and labeled accordingly, whereas edge detection is used as a term for techniques that make use of gradients in images. For the extraction of flame fronts in OH-PLIF images, two methods have been established. They are referred to as the conventional methods in this manuscript. (1) The first approach (segmentation technique) aims to separate the zones of burnt from unburnt gases based on the intensity values of the OH signal in individual pixels. This is usually done by thresholding with previous preprocessing steps like contrast enhancement and low-pass filtering, to ensure the subsequent steps are not too sensitive toward noise. From there, the flame front can be extracted as the boundary between the two zones. The choice of the threshold has an essential impact. Often thresholds are set manually and empirical values are used as starting points (Griebel et al. 2005). This may lead to uncertainties and cannot be executed unsupervised for significantly differing image regimes. The threshold can, however, also be selected statistically in a completely unsupervised manner using Otsu segmentation (Otsu 1979). This method is based on the histogram of all pixel intensity values in one image, and maximizes inter-class variance assuming a bimodal distribution. Adaptive or sliding window thresholding (Bradley and Roth 2007) can be beneficial if strong spatial variations in illumination or intensity changes in the image are present. (2) The second approach (edge detection technique) makes use of a possible definition of the flame front which correlates the location of peak heat release with the location of peak OH-PLIF intensity gradients in a flamelet. The edge can be obtained by first-derivative-based filters which approximate the gradient, e.g. the Sobel, Prewitt, or Roberts operator, and the location of their peak. If second-derivative-based filters are applied, such as the Log operator, the location of its zero crossing defines the edge (Yousaf et al. 2018). Canny edge detection (Canny 1986), for example, is a multi-step algorithm which makes use of derivative kernels to locate edges. The second approach (edge detection) is favored, because compared to the first (Otsu segmentation) it considers local information. Apart form the actual quantity we are interested in, which is the OH number density, the PLIF signal has a multi-dimensional dependency on pulse energy, upstream absorption, and thermodynamic conditions (Lacassagne et al. 2023). This is why global thresholding of intensities makes comparability rather challenging.

Fig. 2
figure 2

Flame front detection routine using the gradient method

A simple routine for the second approach (edge detection), referred to as the gradient method in this manuscript, might look as follows (depicted in Fig. 2a-c, and e). In a first step, the retrieved OH-PLIF image is preprocessed by applying a low-pass filter, e.g. Gaussian blurring. Then the gradient is established with the Sobel \(3\times 3\) filter. Finally, to minimize erroneous predictions, a region of interest (ROI) is selected, corresponding to the averaged flame brush, in which the maximum of the gradient is extracted using the Canny method (Canny 1986), resulting in the binary mask, which ought to correspond to the ground truth. The disadvantage of the gradient method is that two hysteresis threshold parameters, i.e. \(\tau _{\text {High}}\) and \(\tau _{\text {Low}}\), have to be chosen for the Canny edge detection, and that usually some sort of blurring, with parameter \(\sigma _{\text {Blurr}}\) in case of Gaussian blurring, has to be applied to cope with noise. The loss of information due to blurring mainly affects the small scales of the flame front, which are crucial for the rather challenging example of highly wrinkled turbulent premixed flames. Additionally, also operating conditions of the combustor have a strong influence on the image quality. In particular, as the pressure increases, the fluorescence yield decreases approximately inversely proportional due to collisional quenching, thus, the signal-to-noise ratio (SNR) decreases significantly (Tu et al. 2020). This makes increased blurring a necessity, and affects the quality and uncertainty of the extracted flame fronts derived from the raw data. A minimal loss of information due to blurring is to be targeted. Furthermore, thermo-diffusive effects in the case of H2-blends lead to lower OH concentrations in concavely curved regions of the flame front (Bell et al. 2007). The OH gradients in these regions can possess similar magnitudes as non-flame front structures, which makes the choice of the hysteresis thresholds extremely difficult. The above reasonings lead to the fact that the processing should be carried out image or regime specific in a supervised way, especially in the case of preferential- or thermo-diffusively unstable mixtures.

In an attempt to make the choice of these hysteresis thresholds more autonomous, they are sometimes set to \(\tau _{\text {High}}=\tau _{\text {Otsu}}\) and \(\tau _{\text {Low}}=\tau _{\text {Otsu}}/2\) (Setiawan et al. 2017). However, this does not always yield satisfactory results. A recently published paper reported the use of the Filtered Canny algorithm; Chaib et al. (2023) used the rough flame front contour obtained by an initial Otsu segmentation as ROI for Canny edge detection. Advanced preprocessing schemes such as nonlinear edge-preserving filters and contrast enhancement techniques were applied in the aforementioned work and user-based parameter tuning was eliminated. Nevertheless, even without sophisticated preprocessing, the idea of a preliminary Otsu segmentation to obtain a streamlined and much smaller ROI for Canny edge detection, cf. Fig. 2d, enables reduced blurring and simplifies the parameter selection for Canny edge detection considerably.

3 Deep learning-based flame front detection

3.1 Motivation for deep convolutional neural networks

Classical pixel-based image postprocessing most commonly involves some kind of filtering, whether for noise removal or for other specific tasks, e.g., the computation of a gradient. The convolution process is illustrated in Fig. 3 for the case of a \(3\times 3\) Sobel filter. Filters for different applications differ in their predefined weights, but are applied in the same way. The limitations of conventional flame front detection methods due to filtering mentioned in Sect. 2 lead to the question of an alternative approach and as such, whether an optimal filter for a particular task exists and how it can be found. Instead of manually designing and testing filters with predefined weights, the optimal weights of a convolutional filter can also be learned in a Neural Network (NN), i.e. in a Convolutional Neural Network (CNN), in a supervised way. A brief overview of the ML nomenclature used in the remainder of this manuscript can be found in Table 1. A simple exemplary architecture of a deep CNN can be seen in Fig. 4; the VGG16 (Simonyan and Zisserman 2015). Each block consists of several convolutional layers, in which a number of convolutional filters (cf. Fig. 3) are applied and followed by an activation function, as well as a pooling layer (cf. dimensions in Fig. 4 for an exemplary image of size \(256\times 512\) pixels). The resulting excellent ability to perceive underlying features has made CNNs, especially deep CNNs, an indispensable tool in modern image processing.

Fig. 3
figure 3

Application of the horizontal \(3\times 3\) Sobel kernel (green) on an array (blue). The resulting value is stored in the position of the array shaded in orange. A \(3\times 3\) convolutional filter with undefined learnable weights is depicted in gray

Table 1 Machine Learning nomenclature in the context of flame front detection
Fig. 4
figure 4

Schematic of the VGG16 backbone architecture (without fully connected layers). Layer dimensions are indicated for an exemplary image of size \(256\times 512\) pixels

The learning process in a NN always occurs in two steps: (1) Inference or forward pass, and (2) backpropagation or backward pass. (1) During inference, a batch of samples of the dataset is passed through the network. All the weights according to the current training state are applied to obtain an output prediction. The loss is computed as preparation for the backpropagation step. The choice of the loss function is very problem specific and has a significant impact on the outcome. (2) The loss is now minimized with a gradient descent algorithm, since this is a non-convex optimization problem. Hence, in a second step, the gradient of the loss with respect to each weight is computed over all its connections from the output until the respective layer, i.e. propagated back. These steps are repeated until a stopping criterion is met.

It is important to note that only local minima are found, determined by the learning rate and the starting point of gradient descend, due to the non-convexity of the problem. Moreover, the solution is non-unique and NNs are not invariant. For that matter, transfer learning and data augmentation are generally used, especially when dealing with sparse training data and an increased risk of overfitting. Adopting a pretrained network leads to a starting point in the loss landscape expected to be closer to the targeted minima, and therefore to a higher probability to obtain the optimal solution and reduced computational effort for training. Data augmentation helps network invariance, mitigates overspecializing on certain features, and the dataset can be artificially increased to a certain extent. Further concepts and underlying mathematics can be found in, e.g., Murphy (2012).

3.2 Models for flame front detection

A thorough model overview in the general context of modern object segmentation, not only limited to semantic image segmentation that can be used in our case, is given in Wang et al. (2022). As a counterpart, a solid review on edge detection technology is provided by Shou-Ming et al. (2021). This work focuses on CNN-based Deep Learning (DL) models in order to leverage their excellent abilities for image processing and the idea of learning optimal filters, as motivated in the previous subsection. Hence, traditional ML or spectral clustering methods were not considered in this work. In the following, the most important DL-models relevant to flame front detection are presented and further options illustrated. More details about the architectures can be found in the original sources provided. A distinction can be made between two categories: (1) The encoder-decoder type CNN, and (2) the multiscale fusion type CNN.

(1) In order to compute a loss, the output dimensions of the CNN need to match the label, which has the same spatial resolution as the input. Encoder-decoder type CNNs achieve this by attaching a decoder, usually the same backbone, to an encoder, connected via the latent space. A prime example of this architecture is the U-Net (Ronneberger et al. 2015), originally developed for semantic segmentation applications in biomedical imaging. This model additionally has some skip connections (cf. gray arrows in Fig. 5), concatenating arrays from convolutional blocks of similar sizes between the encoder and decoder, in order to combine information from the down- and up-sizing branches of the model. Over time, improvements were made and more sophisticated models were developed, for example, the Attention U-Net (Oktay et al. 2018), the Residual U-Net (Zhang et al. 2018), and the U-Net++ (Zhou et al. 2018), to name a few. More detailed illustrations and explanations are summarized in Siddique et al. (2021). LinkNet (Chaurasia and Culurciello 2017), on the other hand, was developed as an alternative with reduced model parameters in order to be computationally more efficient for applications in embedded systems. Compared to these concatenation connections in the basic U-Net, the input of each encoder layer in the LinkNet is directly bypassed and added to the output of its corresponding decoder layer.

(2) A different approach is chosen for the multiscale fusion type CNN. Here, a prime example is the Holistically-Nested Edge Detector (HED) (Xie and Tu 2015) schematically depicted in Fig. 6. The idea is to branch off arrays from each convolutional block and to enlarge them back to the original image size (side outputs). A loss is computed for all individual side outputs, leading to an approximation of the label in the corresponding layer of the HED. The output of the model (fuse-channel), for which a loss is also computed, consists of the fusion of the side outputs. It can be seen that the side outputs after the lower sized blocks only consist of very coarse features. A smearing effect acts on the model output (fuse-channel), if these coarse features are now merged with side outputs from the upper levels, which are very detailed. Similarly to the U-Net, there are also further improved and related models for the multiscale fusion type, such as N\(^4\)-fields (Ganin and Lempitsky 2014), DeepContour (Shen et al. 2015), Richer Convolutional Features (RCF) (Liu et al. 2019), or DeepEdge (Bertasius et al. 2015).

Furthermore, there are hybrid models that combine both ideas, that of an encoder-decoder and that of a fusion of outputs from multiple layers. These models usually run several streams or pyramids in parallel. Examples are the HED-UNet (Heidler et al. 2022) or the Pyramid Scene Parsing Network (PSPNet) (Zhao et al. 2017). Also worth mentioning is a family of models for instance segmentation, which refers to the simultaneous detection of objects and their segmentation on the respective patch in the image. An example is the family of the Region-Based CNN (R-CNN) with its successors such as the Mask R-CNN (He et al. 2017). Some of these models however, e.g., the Feature Pyramid Network (FPN) (Lin et al. 2017), can also be readily used for semantic segmentation.

Fig. 5
figure 5

Schematic of the basic U-Net using a VGG16 backbone

Fig. 6
figure 6

Schematic of the basic HED using a VGG16 backbone

4 Implementations

4.1 Dataset, preprocessing and metrics

Labels are needed in order to train the DL models and to evaluate all models, including the conventional methods. In an ideal case the labels correspond to the ground truth. For the task of flame front detection in OH-PLIF images, the labels are not readily available, so they have to be created first. Thus, the images were annotated with a labeling tool called Label Studio.Footnote 1 The accuracy of the hand-labeling step depends on the performance of the human annotator, which is why it is generally accepted and common practice in ML when done by multiple domain experts (Karimi et al. 2020). We assume that the created labels are a reasonable approximation for the ground truth when they fit the raw image and are within the lines of the steepest gradients, using the convenient definition of the flame front which correlates the location of peak heat release with the location of peak OH-PLIF intensity gradients.

Table 2 Utilized dataset. Additional information is provided in the supplementary material
Table 3 Utilized evaluation metrics
Table 4 Parameter settings leading to the best possible IoU-scores for the implemented conventional methods serving as benchmarks

Producing a large number of good quality labels is not very feasible, as it is a time consuming process. For this work, the dataset consisted of a total of 60 annotated images. 10 images each were gathered at 6 different operating conditions. Further specifics can be found in Table 2. The following preprocessing steps were applied to the entire dataset of single channel images (OH intensity signal): The background was subtracted, the images were corrected for the laser sheet inhomogeneity w.r.t. the vertical coordinate and a crop was made to discard the irrelevant section of the images. The cropped data was resized to \(512\times 256\) pixels using bicubic interpolation in order to map all images to the same size. Then, min-max scaling was applied such that all 8-bit images possess integer values \(\in \{0, \ldots , 255\}\). The SNR was calculated according to Sweeney and Hochgreb (2009), using the formulation

$$\begin{aligned} \hbox {SNR} = \dfrac{\mu _{P}-\mu _{R}}{\sigma _{P}} , \end{aligned}$$
(1)

where \(\mu\) is the mean and \(\sigma\) the standard deviation of all pixel intensities measured on either the product side (denoted by the subscript \(_P\)) or the reactant side (denoted by the subscript \(_R\)). Since a clearly separated unburnt and burnt zone are needed for this purpose, the images were binarized after edge detection with the Otsu-Aided Gradient Method described in Sect. 4.2. Mean SNR and standard deviations over the image set in a regime (\(\mu _{\text {Set}}\), \(\sigma _{\text {Set}}\)) are indicated in Table 2.

All metrics employed in this work are listed in Table 3. They are computed from the confusion matrix of hypothesis testing, where all predicted pixels are compared to the label and thus assigned as either, False Positive (FP), True Positive (TP), False Negative (FN), or True Negative (TN). In the context of flame front detection, Positives (P) mark pixels predicted as flame front and Negatives (N) mark pixels predicted as background. True (T) and False (F) indicate a correct or incorrect identification of the pixels w.r.t. the ground truth, respectively. The metrics in Table 3 were utilized because most of them are independent of the majority class of TN in our highly unbalanced task in which flame fronts occur much less frequent than the background class. The flame front class accounted for 2.122% of all pixels in the entire dataset. Note that the resulting metrics for the same flame can change depending on the cropped image region and the resolution.

4.2 Benchmark models

Conventional benchmark methods were implemented and evaluated to put the results of the DL models into perspective. The following three conventional methods, which have already been roughly introduced in Sect. 2, were used: (1) The gradient method with manual selection of the hysteresis thresholds \(\tau _{\text {High}}\) and \(\tau _{\text {Low}}\) for canny edge detection. This implementation is referred to as the "Gradient Method (\(\tau\) Supervised)" in this manuscript. (2) As an alternative to the manual selection of these hysteresis thresholds, a gradient method was chosen where \(\tau _{\text {High}}\) and \(\tau _{\text {Low}}\) were set to \(\tau _{\text {Otsu}}\) and \(\tau _{\text {Otsu}}/2\) respectively, which is why this method is herein also referred to as the "Gradient Method (\(\tau\) Unsupervised)". (3) As a third conventional method, the idea of a preliminary Otsu segmentation to obtain a streamlined ROI for Canny edge detection, herein referred to as the "Otsu-Aided Gradient Method", was implemented as a simplification to the proposed Filtered Canny algorithm by Chaib et al. (2023) excluding the applied preprocessing schemes. The flame front contour of the preliminary Otsu segmentation was thickened by 8 pixels. In all three benchmark methods, only blurring was used as preprocessing, as all other advanced preprocessing schemes can also be used for DL methods and are therefore not relevant to this comparison. The blurring parameter \(\sigma _{\text {Blurr}}\) must be selected by the user in a supervised manner for all three conventional benchmark methods. These parameters (\(\tau _{\text {High}}\), \(\tau _{\text {Low}}\) and \(\sigma _{\text {Blurr}}\)) only provide satisfactory results if they are specifically determined for each image and combustion regime. To highlight the best possible results, a grid-search over all parameters was performed to find the optimal parameters in terms of IoU-score. This score is the main metric in the present work, as it was used in the loss function of the DL models that were subsequently optimized for it. The grid-search was started with \(\sigma _{\text {Blurr}}\) of the Gradient Method (\(\tau\) Unsupervised). The same blurring parameter was used for the Gradient Method (\(\tau\) Supervised), where both hysteresis thresholds were optimized thereafter. For the Otsu-Aided Gradient Method, solely \(\sigma _{\text {Blurr}}\) was optimized; the thresholds \(\tau _{\text {High}}\) and \(\tau _{\text {Low}}\) were set to \(\tau _{\text {Otsu}}\) and \(\tau _{\text {Otsu}}/2\). The optimized regime-specific parameters of all conventional methods are summarized in Table 4.

As a further benchmark, and to assess the quality of annotation and segmentation by human eye and hand, the same set of images (Table 2) was labeled by a different person and subsequently evaluated. This result is presented in Sect. 5 as a point of reference together with the results of the other conventional methods.

Table 5 Model size and training parameters for the implemented DL models until convergence

4.3 Deep learning models

The simplest form of the encoder-decoder and the multiscale fusion type were implemented, namely a U-Net and a HED, as well as the LinkNet and FPN as alternatives. Furthermore, the influence of several backbone architectures and the attention connections were tested on the U-Net. For a fair comparison independent of the platform’s performance, all implementations were made in Python (3.10.12)Footnote 2 on Google Colab,Footnote 3 with TensorFlow (2.15.0)Footnote 4 and the Keras (2.15.0)Footnote 5 application programming interface (API). The HED architecture was written from scratch, while the architectures of the U-Net with different backbones, the LinkNet and the FPN were built using a Keras library called segmentation-models (1.0.1),Footnote 6 and the Attention U-Net was built using keras-unet-collection (0.1.13).Footnote 7 In the following, some important concepts are highlighted and the implications with respect to the implementations are discussed.

As usual in ML, the total dataset was divided into a training, validation and test set in order to monitor overfitting during training (training and validation set) and to perform unbiased evaluations with the final model (test set). These subsets were randomly drawn with a split ratio of 2/3, 1/6, and 1/6, respectively. To ensure adequate significance of the evaluations, reasonable representations of the different regimes and SNR conditions in the subsets were taken into account in order to proceed.

The annotated dataset is not particularly large for reasons of feasibility. This is a significant disadvantage for the use of DNN, which are very prone to overfitting. However, there are two very strong measures that can be taken to counteract this data sparsity; usage of transfer learning (1) and data augmentation (2).

(1) In this work, pretrained weights were adopted from ImageNet (Deng et al. 2009). These weights were deployed on the VGG16 (Simonyan and Zisserman 2015), ResNet50 (He et al. 2016) and EfficientNet-B5 (Tan and Le 2019) backbones, respectively. Those versions of ResNet and EfficientNet were chosen due to the similar number of trainable parameters in order to investigate the effect of backbone complexity independent from an increasing number of parameters. A trainable convolution layer was added to the input layer of the networks in order to map the single-channel OH-PLIF image to the network’s predefined 3-channel input. All activation functions were retained, except for a softmax that was added to the final layer. The weights of the encoder of all U-Nets and the LinkNet, as well as the first pyramid of the FPN were frozen in order to save computational effort during training and to maximize impact of the pretrained latent space on the segmentation output.

(2) In this work, random combinations of translations of integer pixel steps, mirroring around both main axes and rotations in 90 degree steps were applied. Other transformations such as scaling or shearing were omitted, as they are not pixel-matching, in order to avoid errors resulting from the interpolation and subsequent rounding to \(\in \{0,1\}\) to obtain the augmented label. It was found that such transformations have a widening or fragmentation effect on the flame front label, which significantly reduced the model performance. In addition to the artificial increase in dataset size, data augmentation also helped to ensure the network not specializing on background features, such as the inlet of the combustor and the anchor points of the flame, but judging purely based on visual information.

For the loss function, Binary Cross-Entropy (BCE), Weighted Cross-Entropy, Focal loss, Dice loss (based on the F1-Score), Jaccard loss (based on the IoU-score), Tversky loss, and all combinations thereof were tested. The mathematical formulation of those loss functions can be found, e.g. in Xu et al. (2023). The BCE-Jaccard loss, cf. Eq. 2, was found to perform the best and was thus subsequently used to train all models described in Table 5. It can be written as

$$\begin{aligned} {\mathscr {L}_{\text {BCE-Jaccard}} = (1-\alpha ) \mathscr {L}_{\text {BCE}} + \alpha \mathscr {L}_{\text {Jaccard}},} \end{aligned}$$
(2)

where

$$\begin{aligned}\mathscr {L}_{\text {BCE}}&= - \frac{1}{N} \sum _{i=1}^{N} (y_i log({\hat{p}}_i) + (1 - y_i) log(1 - {\hat{p}}_i)) , \\ \mathscr {L}_{\text {Jaccard}}&= \dfrac{\sum _{i=1}^{N} y_i {\hat{p}}_i +1}{\sum _{i=1}^{N} y_i+\sum _{i=1}^{N} {\hat{p}}_i-\sum _{i=1}^{N} y_i {\hat{p}}_i +1} , \end{aligned}$$

the binary label for the i-th pixel is \(y_i \in \{0,1\}\), the predicted probability of the i-th pixel corresponding to the positive class is \({\hat{p}}_i\in [0,1]\), and \(\alpha =\) 0.5 was used.

Google Colab’s free graphics processing units (NVIDIA Tesla T4 GPU) were used for accelerated training. During the training process, the CNNs were fed with the training and validation data in batch sizes of 10 images, applying data augmentation as part of the pipeline. Adam optimizer, a stochastic gradient descent method with adaptive estimation of moments, was applied. The models were trained in two steps with decreasing learning rates until convergence, while monitoring the most important metrics for the training and validation sets separately with livelossplot (0.5.5).Footnote 8 The convergence criterion relied on the divergence of the monitored learning curves of the training and validation set, respectively, i.e. when the loss of the validation set reached a plateau while the loss of the training set still decreased. Various model states surrounding the corresponding epoch were subsequently assessed both quantitatively and visually based on the validation set. A second round of learning commenced with the best scoring model state not showing signs of overfitting. This time, a lower learning rate was employed to achieve a slightly improved position within the local minima of the loss landscape. The same criteria were applied to select the model state. Further, this procedure was repeated for each model with two additional train/val/test splits to check that the final model state was representative. The number of trainable parameters, as well as the learning rate with the corresponding number of epochs of training and the time requirement until convergence is shown in Table 5.

The output of CNNs are soft predictions, i.e. probabilities. Hence, a threshold has to be set during inference to obtain a binary mask of the predicted flame front. This threshold \(\tau _{\text {Hard}}\) can be selected (1) based on the maximum likelyhood, which in the binary case corresponds to \(\tau _{\text {Hard}}\) = 0.5, (2) based on maximising a single metric, e.g. the IoU-score, or (3) based on the relative performance of multiple metrics, e.g. with the Receiver Operating Characteristic (ROC) or Precision-Recall Curve (PRC). These \(\tau _{\text {Hard}}\) of (2) and (3) are usually evaluated based on the validation set and come with the final model, in order to avoid turning them into a hyperparameter that has to be selected during inference.

5 Results and discussion

5.1 Model evaluation

In this section, all methods and models are evaluated based on the test set, which was unseen so far. The ROC and PRC curves (curves for variable thresholds \(\tau _{\text {Hard}}\)) are presented in Fig. 7. Note that the conventional methods don’t provide soft predictions and therefore only yield one point in this figure. While one of the above options (1, 2 or 3) could have been chosen to determine \(\tau _{\text {Hard}}\) for inference, we now want to highlight the best possible IoU-score for a fair comparison of peak performance between all methods and models on the test set. Similar to the parameter optimization for the conventional methods, the threshold that produces the best possible IoU-score, i.e. \(\tau _{\text {Best IoU}}\), is marked in Fig. 7 and all corresponding metrics are listed in Table 6. These states of the DL models significantly outperformed the conventional methods of which the Otsu-Aided Gradient Method marked the best. The IoU-score and Recall (=TPR) were factors \(\approx\) 1.2 and \(\approx\) 2.5 higher, respectively, while having similar FPR and \(\approx\) 1.15 times better Precision. The DL models produced similar results as the segmentation by human eye and hand, where clearly the labeling by hand is the limiting factor, though they still remained far from a perfect classifier. In Fig. 7, it can be seen that the curve of the Attention U-Net VGG16 lies above the curves of all other models in both the ROC and PRC independent of the threshold. A comparison of the individual metrics among the DL models in Table 6 shows that almost all high-scores were achieved by the Attention U-Net VGG16, with the exception of Recall, where the U-Net EfficientNet-B5 achieved the most.

Fig. 7
figure 7

Evaluation of the model performances on the whole test set based on a variable threshold by characteristic curves. Note that the conventional methods only yield one point since no soft predictions are obtained. The locations corresponding to the best possible IoU-score are indicated with a marker. All scores corresponding to this state as well as the areas under the characteristic curves (AUROC and AUPRC) are displayed in Table 6. Results of a perfect and random (no skill) as well as human eye and hand classifier are indicated for reference

Table 6 Evaluation of the model performances on the test set based on a fixed threshold

Typical predictions of different models are depicted in Figs. 8 and 9 for a moderate and lower SNR, respectively. It can be seen that the Otsu-Aided Gradient Method had some erroneous predictions in Fig. 8 and many in Fig. 9. In both figures, these cannot be avoided unless more blurring is applied and details are lost, which in turn would also affect the scores. Further, the HED produced flame front predictions with uneven thickness, stemming from its main idea; the fusion of several differently coarse and smeared layers of the CNN. Some DL models possessed better capabilities than others in properly detecting burnt and unburnt gas pockets, while some models featured more characteristic high-frequency, not necessarily connected flame front predictions. For that matter, three different U-Net variations are depicted for comparison in the lower column of both Figs. 8 and 9. The Attention U-Net, for example, showed high-frequency contours close to but not within the ground truth. This behavior is probably due to the attention gates visually adapting this way for noisy images. To avoid such non-physical, high-frequency erroneous predictions in general, a formulation of penalizing them, or for example a formulation containing connectivity and curvature conditions could be implemented in the loss function in Eq. (2). Moreover, backbone complexity seemed to visually retrieve a slightly higher fraction of relevant instances than the other models, specifically concerning pockets. This can also be concluded from the EfficientNet’s highest Recall (=TPR) out of all basic U-Net Models, and the monotonic increase in Recall over the VGG16, ResNet50, and EfficientNet-B5 backbones.

Fig. 8
figure 8

Characteristic predictions of different models overlayed in white color on a raw image of regime #2. The SNR is 2.189

Fig. 9
figure 9

Characteristic predictions of different models overlayed in white color on a raw image of regime #6. The SNR is 1.833

However, too much confidence should not be placed on the observations from individual visual samples of the test set, nor on comparisons between tiny differences in scores of individual DL models. This is, firstly, due to the non-convex nature of training NNs and obtaining the most optimal model state, and secondly, due to our sparse dataset, considering the statistical significance of the few samples in training/validation as well as testing. Influence of the second reasoning can only be eliminated in the limit of an infinite dataset. Rather, this comparison aims to point out the differences between the two classes of models, DL and conventional. All in all, two trends can be noticed. (1) The DL models performed much better in capturing more small-scale structures (mainly evident from higher Recall scores) and meanwhile having fewer erroneous positive predictions (supported by higher Precision scores) than the conventional methods, while mitigating erroneous predictions in the conventional methods is only possible with increased blurring and an associated loss of small-scale accuracy. (2) Secondly, the DL models produced much more satisfying results than the conventional methods for image conditions with lower SNR. This can also be observed when the IoU-scores of e.g. the U-Net VGG16 and the Otsu-Aided Gradient Method are evaluated on each regime separately, as shown in Table 7. It can be seen that all DL models performed better on the most challenging regime than the best conventional model did on the easiest regime.

Table 7 Performance of the U-Net VGG16 and the Otsu-Aided Gradient Method on each regime of the test set

5.2 Performance limits for low signal-to-noise ratios

Real combustion systems operate at much higher pressures compared to the conditions of the dataset utilized in this study. In addition, hand labeling might become impossible with further deteriorating image quality. Therefore, the applicability limits of DL models outside the image regimes included in the training data are extremely important. To investigate this behavior, the test set of the labeled dataset was distorted with artificial noise and the IoU-scores were evaluated for predictions of all U-Net variations and the Otsu-Aided Gradient Method. The noise was modeled as a combination of Gaussian noise with a standard deviation in the range of \(\sigma\) = 0\(\div\)23.5% of the maximum pixel value and a pixel dropout rate (pepper noise) in the range of 0\(\div\)15%, which yielded SNR between 0.463\(\div\)2.572. As previously, the best possible IoU-score per data point was found by optimizing the threshold (\(\tau _{\text {Hard}}\)) or blurring (\(\sigma _{\text {Blurr}}\)) parameters for the DL and conventional models, respectively, using a grid-search.

The behavior of the Otsu-Aided Gradient Method, U-Net VGG16, U-Net EfficientNet-B5 and Attention U-Net VGG16 is depicted in Fig. 10. The Attention U-Net VGG16 was most resilient toward artificial noise distortions in terms of IoU-score. From the intersection of both smoothing splines, it can be concluded that the Attention U-Net VGG16 performed on average better than the Otsu-Aided Gradient Method for SNR above \(\gtrsim\) 0.85. Thereby, the performance difference started to rise significantly from SNR \(\gtrsim\) 1.25. Similarly, all other DL models started to significantly outperform the Otsu-Aided Gradient Method from SNR \(\gtrsim\) 1.4 at the latest. Moreover, the Attention U-Net VGG16 could maintain better Recall values for decreasing SNR than the other DL models, but in return seemed to trade them for Precision, which was very similar to the other DL models. Finally and most importantly, the bottom plot in Fig. 10 reveals that the U-Net VGG16 was the only model not showing an increasing trend in FPR for decreasing SNR. It seems that this model predicted the flame front class very conservatively below SNR \(\lesssim\) 1.5.

Fig. 10
figure 10

Behavior of some DL models and the best performing conventional method under artificially distorted noise conditions. Smoothing splines were fitted. The regime specific evaluations of the undistorted data points are depicted with a slightly bigger and non-transparent marker. These IoU-scores for the Otsu-Aided Gradient Method and the U-Net VGG16 correspond to the data in Table 7

Fig. 11
figure 11

Some examples of artificially applied noise to an image of regime #4. The ground truth and predictions of the U-Net VGG16, U-Net EfficientNet-B5 and Attention U-Net VGG16 are depicted in the columns from left to right. The row-wise conditions are as follows: (1) SNR = 1.854, undistorted; (2) SNR = 1.673, Gaussian noise \(\sigma\) = 3.72%, pepper noise = 2.37%; (3) SNR = 1.357, Gaussian noise \(\sigma\) = 8.67%, pepper noise = 5.53%; (4) SNR = 1.204, Gaussian noise \(\sigma\) = 11.2%, pepper noise = 7.11%; (5) SNR = 0.969, Gaussian noise \(\sigma\) = 16.1%, pepper noise = 10.3%

The above discussion can be confirmed by some visual examples, depicted in Fig. 11. It is apparent that the U-Net EfficientNet-B5 started having a lot of erroneous, seemingly random predictions far away from the ground truth much earlier than the other models. Hence, it can be concluded that a model with this complex of a backbone might work very well in a high to moderate SNR scenario, however, it is too overconfident for those very low SNR conditions \(\lesssim\) 1.35. The Attention U-net VGG16 continued to show the behavior of high-frequency erroneous predictions close to but not within the ground truth for moderate to low SNR, as discussed in the context of Fig. 9. This type of predictions seems to be the outcome of trading better Recall for more moderate Precision, as quantitatively expected from the discussion of Fig. 10. Compared to the U-Net EfficientNet-B5, however, it did not show signs of overconfidence represented by seemingly random predictions. Therefore, a relation between the attention gates’ mechanisms to cope with noise and the characteristic high-frequency predictions can be concluded. All in all, the U-Net VGG16 had the most pleasing results under artificial noise distortions. Keeping the FPR low and being conservative pays off, although IoU-scores were not as high as for the Attention U-Net VGG16 and fewer total flame front predictions were made. However these few guesses are way more reasonable, judging from human perspective, and therefore such a model provides the most added value for the use under low SNR conditions. Reasonable visual performance limits were encountered around SNR \(\approx\) 1.1, (cf. last two rows of Fig. 11).

5.3 Computational effort for inference

For model inference, a prediction with the same settings and without the need for parameter tuning can be achieved by applying e.g. the maximum likelyhood as threshold \(\tau _{\text {Hard}}\) to the output of the CNN, as mentioned in Sect. 4.3. A similar computational effort was achieved during inference for the U-Net VGG16 with 0.685 s/image compared to the Otsu-Aided Gradient Method with 0.615 s/image (regime specific parameter tuning excluded) on a T4 GPU. Hereby, the DL model has the advantage that images can be fed through in different batch sizes; peak times for the Attention U-Net VGG16 were achieved with a batch size of 58 images, ensuring optimal load despite the huge number of parameters of the CNN. Backbone complexity increases inference time, which should be taken into account depending on the usage goal.

5.4 Some considerations concerning the training set

Note that the applicability range of a CNN is determined by the image samples trained on. This has some implications. An extension of the applicability limits of the DL models w.r.t. noise, as discussed in Sect. 5.2, can be achieved, e.g. by adding random artificial noise distortions during data augmentation of the training process. This significantly alters the statistical distribution of the data input given to the model. Hence, all mentioned values in Sect. 5.2 are expected to indicate upper bounds in case this tactic is applied. Moreover, the limits can even be extended to regimes where it is practically impossible to produce labels that still legitimately approximate the ground truth (due to too challenging image conditions), by augmenting training samples, for which annotating data is feasible.

This principle of feature engineering can in general be applied for better generalization of the model. For example, to enhance the transferability of the performance to other test rigs and flame configurations, different flame types such as swirl flames, can be included in training and validation. Additionally, the datasets can be extended with synthetic data by means of generative models, such as diffusion models or Generative Adversarial Networks (GAN), that aim to learn and generate the probability distribution of a given dataset, or domain randomization, as e.g. applied in Jose and Hampp (2024). The transferability of the evaluated models to other test-rigs and flame configurations was not assessed as part of this work. Rather we want to highlight the high degree of accuracy that can be achieved with surprisingly few data, which allows for transferability, not only to other flame types, but also to other segmentation tasks in experimental fluid dynamics and combustion, by fine tuning these models (based on the supplementary material provided) with new domain specific annotated data.

Another consideration that should be made is that conventional gradient based techniques typically require application of an edge-preserving filtering scheme to work properly, whereas DL models could additionally eliminate the need for operations such as background subtraction and laser sheet inhomogeneity correction if trained on raw images. Thus, since CNNs are theoretically also able to learn these operations, a reduction of overall postprocessing effort could be another potential benefit.

Finally, another advantage of the DL approach compared to traditional flame front detection is the extendability to instance segmentation and multi-class pixel classification, for example offering the possibilities to separately detect and distinguish between the main flame front, unburnt and burnt pockets.

6 Summary and conclusions

In this work, the potential of ML-based methods for the segmentation of turbulent premixed flame fronts in OH-PLIF images at elevated pressure conditions was explored. Simple DL models that can be used for this task, as well as the most important underlying principles were introduced. Basic implementations of the highlighted supervised CNN architectures were trained on a dataset of a total of 60 annotated images stemming from 6 differing combustor operating conditions. With transfer learning and data augmentation, efficient mitigation strategies against the disadvantages associated with sparse training data were highlighted. The DL models were evaluated with benchmark segmentation metrics (IoU-score, F1-score, Precision, Recall, FPR), the Receiver Operating Characteristic and the Precision-Recall Curve. Thereby, three conventional methods based on the concept of the steepest gradient of the recorded OH-LIF intensity served as baseline implementations. Further, the influence of backbone complexity and attention gates were tested, and the performance limits for very low SNR were investigated by distorting images with artificial noise.

All supervised DL models significantly outperformed the conventional implementations. Comparing model states with the best possible IoU-scores, IoU-scores and Recall (=TPR) values were found to be up to a factor \(\approx\) 1.2 and \(\approx\) 2.5 higher, respectively, while having similar FPR and \(\approx\) 1.15 times better Precision. The results were on similar performance levels as segmentation by human eye and hand for high to moderate SNR. The DL models captured small-scale details much better with less erroneous positive predictions, while an increase in Precision for the conventional methods is associated to more low-pass filtering (blurring) during preprocessing and a loss of information in small-scale structures. This effect was more pronounced for noisy images. The DL models extended the range of image conditions w.r.t. SNR that can reliably be processed with reasonable detection performance, compared to previously existing conventional methods.

Concerning the comparison between evaluated DL models, it was shown that multiscale fusion models such as the Holistically-Nested Edge Detector yielded uneven flame fronts and are therefore not suitable for this task. For high to moderate SNR, a more complex backbone such as an EfficientNet as well as features like attention gates proved to be powerful in the U-Net. For moderate to very low SNR, a simple U-Net is the better choice, since it provided more conservative predictions by keeping the FPR low, together with artificial noise distortions during data augmentation in training.

The presented DL-based flame front detection algorithm requires comparable computational effort for inference as conventional methods while eliminating the need for user-based parameter tuning. It provides significantly increased performance for single-shot flame front detection of large image datasets where individual image processing is infeasible. The results demonstrate that the use of ML can bring great benefits to postprocessing of experimentally obtained data not only limited to the presented case. Future work will focus on transferability, implementing customized loss functions and the extension to multi-class classification, as well as the application to extract local physical quantities from single-shot images in a wide range of pressures for combustion research.