Machine learning models for the secondary Bjerknes force between two insonated bubbles

The secondary Bjerknes force plays a significant role in the evolution of bubble clusters. However, due to the complex dependence of the force on multiple parameters, it is highly non-trivial to include its effects in the simulations of bubble clusters. In this paper, machine learning is used to develop a data-driven model for the secondary Bjerknes force between two insonated bubbles as a function of the equilibrium radii of the bubbles, the distance between the bubbles, the amplitude and the center frequency of the ultrasound wave. The sign of the force may change with the phase difference between the oscillating bubbles. Meanwhile, the magnitude of the force varies over several orders of magnitude, which poses a serious challenge for the usual machine learning models. To overcome this difficulty, the magnitudes and the signs of the force are separated and modelled separately. A nonlinear regression is obtained with a feed-forward network model for the logarithm of the magnitude, whereas the sign is modelled by a support-vector machine model. The principle, the practical aspects related to the training and validation of the machine models are introduced. The predictions from the models are checked against the values computed from the Keller–Miksis equations. The results show that the models are extremely efficient while providing accurate estimate of the force. The models make it computationally feasible for the future simulations of the bubble clusters to include the effects of the secondary Bjerknes force.


Introduction
The secondary Bjerknes force [23,4] is the interaction between two bubbles oscillating in a acoustically driven fluid, and it is induced by the pressure perturbation radiated from the bubbles.The force is thought to be important in the evolution of bubble clusters and has attracted considerable research in the past decades [8,31,32,9,26,2,16,30,35,19,36], which explores the effects of nonlinear correction, multiple scattering, and the coupling with shape oscillation and translation, as well as the experimental measurement of the force.The asymmetricity of the force is discussed recently in [28] taking into account higher order nonlinear coupling between the bubbles, which further highlights the complexity of the force.
Recent experimental evidences [12,22] support the importance of the secondary Bjerknes force in the dy-namics of micro-bubble clusters.The collective behaviors of up to 100 oscillating bubbles are modelled in [15] using the coupled Keller-Miksis equations [20].It is found that the interactions between the bubbles can be both constructive and destructive, and the bifurcation sequences of a system with more bubbles can be much different from a small one.The research again demonstrates the importance of the interactions between the bubbles which are manifested as the secondary Bjerknes force.The force has been used to manipulate bubbles, e.g., as a mean to control micro-devices, which potentially have important applications [18,21,1].Given that bubble clusters are commonly observed in biomedicine, metallurgical industries, food processing, and other applications (see, e.g., [5,3,33,10]), the modelling of the secondary Bjerknes force and hence bubble clusters is a question of significant interests.
Few simulations of bubble clusters so far have employed sophisticated models for the secondary Bjerknes force.Numerical simulations conducted in [22], with a simple model for the secondary Bjerknes force, quali-tatively reproduce the experimental observations on the clustering of bubble clouds.Similar simplified models are also used in the simulations in [27,29,25], which qualitatively reproduces the formation of the Lichtenberg pattern [23].These simulations follow the movements of individual bubbles, thus are based on a Lagrangian approach.Recently a hybrid Lagrangian-Eulerian method is proposed in [24] where bubble oscillation is computed, although the secondary Bjerknes force is not explicitly included.
The past research has yielded considerable physical insights about the secondary Bjerknes force.Unfortunately, due to the complexity of the problem, the insights have yet to be translated into accurate and computationally efficient models.We observe, however, that the complexity of the problem makes it an excellent example for which a data-driven approach can be fruitful.Datadriven methods, especially machine learning, have made tremendous progresses in recent years, as are exemplified and popularized by the success of AlphaGo [34].The methods have been successfully applied to many physical and applied sciences.There is, however, not yet any report of such applications in bubble simulations.The objective of this paper is to use machine learning to build a novel model for the secondary Bjerknes force that is more comprehensive than those previously reported, and more generally, introduce this useful method into the investigation and modelling of bubbles oscillations.
The paper is organized as follows.The dynamical equations for the bubbles are reviewed in Section 2, where the dependence of the secondary Bjerknes force on relevant parameters are highlighted.The data set for the force is described in Section 3. Section 4 introduces the relevant machine learning models to be used to build the model for the force.The practical aspects of the training and testing of the models are also presented.Additional checks are performed in Section 5 where the efficiency of the models is also assessed.The conclusions are summarized in Section 6.

The governing equations
Let D be the distance between the two bubbles.The radius of bubble i (i = 1, 2) is denoted by R i (t) and its equilibrium radius is R Ei .The bubbles are driven by a uniform pressure oscillating harmonically in time: where p 0 is the ambient pressure, p a is the amplitude of the ultrasonic pressure, and ω ≡ 2π f and f are the angular and linear frequencies, respectively.By using a pressure uniform in space, it has been assumed that D is small compared with the wave length of the pressure wave or the bubbles are on the same phase plane of a planar pressure wave.The fluid has density ρ, speed of sound c, surface tension σ and kinematic viscosity ν.
The radii of the bubbles can be described by the Keller-Miksis model [20,4] with additional pressure coupling terms between the bubbles as introduced in [26].Ignoring the time-delay effect, the coupling pressure between bubbles i (i = 1, 2) and j ≡ 3 − i, denoted as p i j , is given by [26] which is valid when the radii R i and R j are much smaller than D. With p i j included, the equation for R i (t) becomes [26]: where is the pressure on the outer wall of bubble i and k is the polytropic exponent for the gas inside the bubble.We note that other models for the oscillation of coupled bubbles exist in the literature.Obviously the machine learning models to be presented below can be used with other models as well.
The secondary Bjerknes force is defined as the timeaveraged pressure exerting on bubble i due to the oscillations of bubble j [8,26].Let F i j be the notation for this force, simple calculation shows that, for small bubbles, F i j can be written as (see, e.g., [8]): where V i is the volume of bubble i.The pointed brackets represent time averaging.In the above expression we follow the tradition where F i j is positive when it is attractive.The secondary Bjerknes force factor f i j [26] is defined as In a bubble cluster, F i j is expected to depend not only on bubbles i and j but also the other bubbles.Nevertheless, when the force was considered in the few bubble cluster simulations [27,29,22] reported so far, F i j had all been calculated from 2-bubble systems, where the contributions from other bubbles were neglected.Empirical fitting of F i j as a function of D was used.The dependence of F i j on other parameters have not been considered.
For a 2-bubble system, the only secondary Bjerknes force factor is f 12 .f 12 depends on many parameters of the system, including R Ei , D, p a , ω, ν, ρ, c, σ, and k.In the present investigation, we choose water as the medium, hence fixing ν at 0.89 × 10 −6 m 2 /s, ρ at 997kg/m 3 , c at 1497m/s and σ at 0.0721N/m.An adiabatic process is assumed so that k is fixed at 1.4, whereas p 0 is assumed to be the atmospheric pressure p atm = 1.013 × 10 5 Pa.The objective of the investigation is to model the dependence of f 12 (hence F 12 ) on the five parameters: D, p a , ω (or f ), R E1 and R E2 .

The data for f 12
The machine learning method is used to discover the complicated dependence of f 12 on the system parameters.The method is data-driven and is based on a large data set for f 12 obtained over a range of values for the five parameters.The distance D ranges from 100µm to 1000µm with an increment of 100µm.The pressure amplitude p a ranges from 40kPa to 150kPa with an increment of 10kPa.This range covers both near harmonic and strongly nonlinear aharmonic oscillations.The forcing frequency f ranges from 20kHz to 40kHz with an increment of 10kHz.Both R E1 and R E2 start at 1µm and end at 10µm with an increment of 2µm.
When the other parameters are held fixed, a bubble pair with radii (R E1 , R E2 ) would have the same f 12 as a pair with radii (R E2 , R E1 ).Therefore the parameter combinations with R E1 < R E2 are removed from the data set, which leaves in total 5400 combinations of parameter values.Eq. 3 is numerically integrated for each combination to obtain the corresponding f 12 .The ode45 solver in MATLAB is used.In each run, the simulation is run for ten periods of the forcing pressure to allow the oscillation becomes stationary.The data from the last two periods are used to calculate the force factor f 12 according to Eq. 6.The data for f 12 obtained this way, and the corresponding parameters, form the dataset for the development of the machine learning models.In the terminology of machine learning, a set of values for the five parameters is called a predictor, the corresponding f 12 is a response.

The machine learning models
The secondary Bjerknes force factor f 12 depends sensitively on the flow parameters.As a result, the magnitude for f 12 varies over many orders of magnitude, which poses a significant difficulty for the development of the machine learning models.
To overcome the difficulty, the data set for f 12 is split into two.The first one contains log 10 | f 12 |, whereas the second one contains the sign of f 12 (sgn f 12 ).Two machine learning models are built for the two sets separately, and the prediction for f 12 is reconstructed from the two models.The first model, given the nature of the data, is a regression model, which is implemented with a feed-forward neural network (FFNN).The data in the second data set are binary (they are either 1 or −1).A classification model, the support-vector machine (SVM), is thus used.If the predictions from the first and the second models are y 1 and y 2 , respectively, the prediction for f 12 is then given by y 2 10 y 1 .
Working with log 10 | f 12 | proves to be crucial.Taking the logarithm reduces the range of the data, and as a result, a FFNN can be found to model | f 12 | (after exponentiation) and hence f 12 with good accuracy.Without separating the magnitude and the sign and taking the logarithm of the magnitude, we failed to find a satisfactory ML model for f 12 .The FFNN and the SVM models are now explained.An FFNN [14] typically includes an input layer, an out put layer, and several hidden layers of neurons.The neurons are connected to and receive inputs from those in previous layers, and similarly, connected to and send outputs to those in later layers.Each neuron is defined by an activation function (also known as transfer function), which processes the inputs and generates an output.The inputs are combined with suitable weights w and biases b in the activation function.Fig. 1 provides a schematic illustration of the structure of an FFNN.All the components in a blue box forms a neuron.Only one neuron is shown explicitly in each layer in Fig. 1 but in reality there could be many.In an FFNN, the number of hidden layers, the number of neurons in each layer, and the activation functions are chosen a priori and, in practice, mostly empirically.The weights w and biases b are determined by 'training' the network to provide the optimal description of the data in the so-called 'training' dataset.The optimal weights and biases are usually obtained by applying optimization algorithms which adjust the weights and biases iteratively through a process called back-propagation.For more details on FFNNs and machine learning in general, see, e.g., [14,17,13].

The architecture and the hyperparameters
MATLAB is used to define, train, validate and test the network [6].The numbers of layers and neurons and the activation functions are the hyperparameters that should be decided at the outset.
For the activation functions, the default setting is adopted, where the hyperbolic tangent sigmoid function is used for the neurons in the hidden layers and the linear function is used in the output layer.Even though in the  machine learning community rectified linear units (Re-LUs) are now recommended for large scale problems, a few tests using the ReLUs for the hidden layers do not show appreciable differences.
As for the numbers of the layers and neurons, it is known that perfect regression can be obtained for any dataset if there is no limit to the number of available neurons.However, the training may become too expensive and the model too inefficient if too many neurons are used.Therefore it is desirable to use as few neurons as possible.Empirical evidences show that, in some cases, the model performance can be improved by using more hidden layers [13].However, there is not yet theoretical justification or guidelines for the optimal choices.As a result, we have adopted the following trial-and-error strategy to decide these two hyperparameters, which is explained briefly here while the numerical evidences is presented later.We start with an FFNN with only one hidden layer, and train it with increasing number of neurons, until satisfactory performance is obtained.After a number of tests, it is found that consistently good results can be obtained with 30 neurons.The total number of neurons is thus fixed at 30, and networks with different numbers of hidden layers are tested to explore how the performance can be further improved.As demonstrated below with numerical results, the best performance is found with two hidden layers and 15 neurons on each.This architecture is thus chosen, which is illustrated in Fig. 1.More details are given in Section 4.1.3.

The training of the FFNN
With the architecture chosen, the weights and the biases in the neurons are then initialized randomly, but they have to be adjusted to improve the performance of the model.This process is called training, in which the dataset for log 10 | f 12 | mentioned in Section 3 is used.For each predictor (i.e., a parameter combination), the FFNN is used to make a prediction for log 10 | f 12 |, which is the response in this model.The prediction is compared with the true value obtained as explained in Section 3. The mean squared error (MSE) between the true and predicted values is used as the performance measure for the model.The Levenberg-Marquardt algorithm [14] is  used to optimize the weights and biases iteratively.In the machine learning terminology, an iteration which scans through all training data is called an epoch.
The training is stopped when a certain stopping condition is satisfied.One of these conditions is that the magnitude of the gradient of the MSE should be sufficiently small.However, in our tests, the training is always terminated due to detection of overfitting.Overfitting is a phenomenon where the model performs well in training, but makes poor predictions for data outside of the training dataset.It is a common problem to be avoided when training a neural network.In this investigation, crossvalidation is used to tackle this problem [17].Specifically, 70% of the dataset for log 10 | f 12 | is randomly chosen to form the training set, while 15% is used for validation, and the rest for testing.In each epoch, the current FFNN is used to make predictions on the validation data set.The MSE of the predictions over the validation set is monitored.If the MSE increases for N o = 6 consecutive epochs (while the MSE for the training set is still decreasing), then overfitting is deemed to have happened and the training is stopped.The value 6 is empirical.If a different N o is used, one might need to adjust other parameters to obtain a model with similar performance.

Numerical results
The architectures of the FFNN models tested in this investigation are summarized in Table 1.For each architecture, the training is run 32 times with random initial- ization.32 models are thus produced, which are slightly different from each other even if they have the same architecture.The 32 optimal validation MSEs for these models obtained from the training are calculated.The median values are plotted with the symbols in Fig. 2 for all 8 architectures, as well as the maximum and minimum values, which are given by the error bars.The result for architecture 1 shows that, with 30 neurons, the validation MSE can be kept under 5% in all runs.This observation has been the basis to choose 30 as the total number of the neurons.Fig. 2 shows clearly that the MSE can be reduced by using multiple hidden layers.Architecture 4, which has two hidden layers with 15 neurons in each, displays the best performance.Further increasing the number of hidden layers appears to somewhat degrade the performance.Based on these results, architecture 4 has been chosen to obtain all the results to be presented below.As an illustration, Fig. 3 shows how the performance and the gradient of the model improves by the training.There are a small number of extraneous samples where ε r can be more than 100%, but, as expected, the majority of the samples have small errors.The cumulative probability distribution, plotted with the dashed line, shows that more than 90% samples in the testing set have relative errors smaller than 30%, and about 85% smaller than 20%.Comparing the errors on the training set and the testing set, it is only slightly more probable to observe larger errors on the testing data, which demonstrates that the model generalizes well.
The above results show that the FFNN model (with 15 neurons on each of the two hidden layers) can provide an accurate model for not only log 10 | f 12 | but also | f 12 |.In terms of the training cost, typically less than 100 epochs are needed to achieve the accuracy depicted in Figs.4-6, which takes less than one minute to compute on a modern laptop.

The support-vector machine model for sgn f 12
In its most simple form, the SVM [17] is an algorithm that finds the optimal line that separates two clusters of points on a plane, hence classifying the points into two classes.In order to introduce some necessary basic concepts, its formulation is briefly explained.It is assumed that a set of N points x j = (x j 1 , x j 2 ) ∈ R 2 ( j = 1, 2, ..., N) is given as the training set.The points are divided into two groups, the positive and negative classes.For simplicity, the data are assumed to be separable, i.e., it is assumed that a straight line can be found to separate the two classes, and that the classifications are known and labelled by 1 and −1, respectively.The classification of point x j is recorded by y j ∈ {−1, 1}.
The optimal separating line is defined as the line that separates the two whilst leaving the largest margins on both sides of the line.Let the equation for the line be f (x) ≡ w T x + b = 0, where w ∈ R 2 and b ∈ R. It can be shown that the best separating line is the solution of the following optimization problem [17]: find w and b that minimize the objective ||w|| 2 such that for all data points (x j , y j ), The optimal objective maximizes the margins on the two sides of the line.The constraints ensure that the points are found on the correct sides of the separating line.
The optimal solution is used to classify new data (not in the training set) in the following way.Let ŵ and b be the optimal solution, and f (x) ≡ ŵT x + b.The classification of a data point x is given by its label sgn f (x).
In the more general cases where the data are not separable, two modifications to the above method have been introduced.Firstly, one may allow a small number of points to be mis-classified, a practice termed using soft margin.Mathematically, this method amounts to replacing Eq. 7 by for ξ j ≥ 0. Meanwhile, a penalty term is added to the objective function to limit the magnitude of ξ j .For this investigation, the term takes the form of C N j=1 ξ j with C being a parameter called the box constraint.A larger C introduces larger penalty, hence reduces ξ j 's magnitudes, which means fewer mis-classifications are allowed.No mis-classification is allowed when C → ∞.
Secondly, one may use a nonlinear curve to separate the data.The idea is implemented by using a function f (x) = w T h(x) + b as the separating curve, where h(x) is a non-linear function that transforms the separating line to a nonlinear separating curve.
The SVM model obviously is also applicable for higher dimensional problems.It is used in this investigation, but the solution is based on its dual formulation, because the dual problem is convex and guaranteed to converge to the global minimum [7].Letting α j be the dual variable corresponding to the constraint given in Eq. 8, the dual problem can be written as min 1 2 subject to the constraints The bivariate function G in Eq. 9 is the kernel function corresponding to the nonlinear function h(x).For this investigation, a Gaussian kernel is used where G(x j , x k ) = exp(−||x j − x k || 2 /σ 2 ) with σ being the kernel scale.
The dual problem is solved with the sequential minimal optimization (SMO) algorithm.The SMO is an iterative descent algorithm.The convergence is monitored by the gradient of the objective function (given in Eq. 9) with respect to α j , specifically, by the difference between the gradient components corresponding to the maximal upper and lower violation of the feasibility conditions [7,11].The iteration is deemed converged when this difference is smaller than a tolerance δ.Once the optimal solution for the dual problem is found, ŵ and b can then be found from α j using the Karush-Kuhn-Tucker conditions, which are then used to classify a new data point.For more details of the SVM methods and the algorithms, see [17,7].The SVM model is applied to classify the data for sgn f 12 , using the MATLAB command fitcsvm which implements the above algorithms.In this case, x is a five dimensional vector, i.e., x = (D, R E1 , R E2 , p a , ω) T and the label y is sgn f 12 .The number of data points is N = 5400.The data are standardized when they are fed into the training algorithm.Specifically, the mean is removed from the data which are then rescaled by the standard deviation.Fig. 7 illustrates the decay of the gradient difference as well as the objective function in Eq. 9 in a typical training process.
The adjustable parameters in the model are the box constraint C, the kernel scale σ, and the tolerance δ.Tests with δ = 10 −3 and 10 −4 show essentially the same results.Therefore δ = 10 −4 has been used.When the separation boundary has a complicated shape, a small σ is required to model the fine features of the boundary.However a too small σ may limit the ability of the model to capture the large scale features in the distribution of the data.The choice of C depends on the accuracy of the data for f 12 .If the data for f 12 likely contain large errors, it is not meaningful to insist perfect classification.The physical model being used here (Eq. 3) has been derived with a few simplifying assumptions.Therefore, it is appropriate to use a large C to limit misclassification, although it is not necessary to achieve zero mis-classification.
With the above discussion in mind, the misclassification rate has been calculated for several different kernel scales σ and box constraints C. A point x j is mis-classified if y j f (x j ) < 1 (c.f.Eq. 8).The mis-classification rate is the ratio of the number of misclassified points, N m , to the total number N. The results are plotted in Fig. 8.The mis-classification rate reaches an approximate plateau quickly when C increases.The results for σ = 0.4 are clearly much worse, whereas small mis-classification is found for other σ's when C is sufficiently large.These observations demonstrate the robustness of the classification scheme as long as σ is not too small.The smallest mis-classification of 2.61% is found at (σ, C) = (0.6, 20), which is considered sufficiently small.Therefore C = 20 and σ = 0.6 are used in what follows.The robustness of the model is assessed in Fig. 9 using k-fold cross-validation [17].In k-fold cross-validation, the dataset is divided randomly into k equal sets, and k models (called the partitioned models) are trained.The ith (i = 1, 2, ..., k) dataset is called the ith test fold, whereas the other k−1 sets form the ith training fold.The ith partitioned SVM model is trained on the ith training fold and evaluated on the ith test fold.The overall assessment is based on performance indices averaged over the k partitioned models.
The most commonly used performance indices are the accuracy, the precision, the recall and the specificity.For each predictor, the model response could either be positive or negative; in either case, it could be either true or false.As a result, the response falls in one of four categories: a positive response could be a true positive (TP) or, coming erroneously from a predictor in the negative class, a false positive (FN), whereas a negative response could be true negative (TN) or false negative (FN).Let N T P , N T N , N FP and N FN be the numbers of TP, TN, FP, and FN, respectively.The accuracy is defined as which simply gives the percentage of correct predictions in both classes.The precision, recall and specificity are, respectively, defined as The precision tells us the probability of a positive response being correct; the recall is the probability of the positives being correctly identified as positive, while the specificity gives the probability of the negatives being correctly identified as negative [17].
Due to the randomness in data partition, the indices averaged over the k partitioned models may fluctuate if the cross-validation is conducted multiple times.Therefore, we repeat the validation 32 times to find the medians and ranges of the indices.The results are plotted in Fig. 9.It is clear that the accuracy, the precision, and the recall of the models are consistently high (more than 98% for all of them), although they drop slightly with the number of folds while the ranges increase slightly (the error bars for the accuracy are too narrow to see on the figure).With fewer folds, the number of data points in each fold, hence in the training set, is smaller.Thus the performance of the partitioned models are expected to somewhat deteriorate.The results for accuracy show that more than 98% data points, with either negative or positive f 12 , are classified correctly.Meanwhile, the result for the precision shows that there is a 98% chance that f 12 is indeed positive when the model predicts so, and the result for the recall shows that there is a 1 − 98% = 2% chance that f 12 is not predicted to be positive when it is actually positive.
Fig. 9 shows that the median specificity and its range display behaviours similar to those of the accuracy, the precision and the recall, but there is stronger dependence on the number of folds, and its range is wider and the median is smaller.The median recall increases with the fold number and reaches approximately 85%, which implies there is a 1 − 85% = 15% chance that f 12 is not predicted to be negative when it is actually negative.The specificity is relatively low compared with the other indices, but this observation does not necessarily reflect poorer performance regarding the samples with negative f 12 .Rather, this behaviour is due to the fact that there is only about 3.4% data points on which f 12 < 0, thus mis-classification has an outsized impact.
The results presented in this subsection show that, with kernel scale σ = 0.6, box constraint C = 20, and tolerance δ = 10 −4 , the trained SVM classifier can effectively model the distribution of the signs of f 12 .

Figure 1 :
Figure 1: The architecture of the FFNN with two hidden layers having 15 neurons on each.

Figure 2 :Figure 3 :
Figure 2: The median and range of the optimal MSE obtained with different FFNN models in 32 runs.

Figure 4 :Figure 6 :
Figure 4: The regression of the training (filled circles) and the testing (empty circles) data.

Figure 7 :
Figure 7: The objective function and the gradient difference as functions of the training epochs in a typical training session.Only the first 100 epochs are shown.

Figure 8 :
Figure 8: The mis-classification rate for different kernel scales σ and box constraints C.

Figure 9 :
Figure9: The left axis: the median accuracy (solid line with circles) and the median specificity (dashed line with squares).The right axis: the median precision (solid line with diamonds) and the median recall (dashed line with triangles).Found in an ensemble of 32 runs.The error bars show the maximum and minimum.

Figure 10 :
Figure 10: The scatter plot for ( f t 12 / f t 12 , f m 12 / f m 12 ), where • denotes the averaging over the dataset.

Table 1 :
The number of neurons (N n ) in each layer for each network architecture.