Introduction

In the field of high-speed machining with processes like milling, turning, and drilling, simultaneous achievement of the performance targets of part accuracy, high surface finish and productivity is often constrained by the occurrence of dynamic instability of the machining process, referred to as chatter. Improvements in process control due to new machining technologies have typically improved one or more of these metrics, but often at the expense of deterioration in the other(s) reflecting a trade-off between process control and productivity. Over the last 50 years, many new technologies such as computer numerically controlled (CNC) machining have allowed greater process control while increasing the capability and flexibility of the process itself. Computer-aided design/computer-aided manufacturing (CAD/CAM) software has enabled digital representation of part geometries and the subsequent generation of computer instructions to command the machine motions required to remove material and leave the desired part geometry. While these technologies have increased the spindle speed range due to new spindle designs, as well as improved tool materials and coatings, the machining parameters are still typically selected based on manufacturer recommendations or, in most cases, operator experience. In practice, operators typically adjust spindle speeds downwards when encountering chatter, reducing productivity. However, in many cases increasing spindle speeds can return stability to the process and improve productivity. Knowledge of the true, as opposed to physics-based or manufacturer-provided, stability model would enable operators to better select initial machining parameters to both maintain dynamic stability and improve part quality without compromising productivity.

Knowledge of the true operational stability model would also enable operators—and ultimately the machines themselves—to make in-process optimum parameter adjustments during production. As machines become more intelligent, self-knowledge of the true stability model is the foundation for self-aware operations. Self-aware machines, as their name suggests, exhibit self-awareness of both their operational health and machining status and, as required, self-control to perform parametric adjustments that maintain continued performance to target levels. For example, self-aware machines can make parametric adjustments to keep themselves operational while waiting for service; can adjust their machining parameters to assure process stability during machining; and can adjust their load to balance production yields in their cell in the event of excess demand or machine downtime. This level of machine intelligence has considerable potential to enhance productivity in the manufacturing environment and to maintain optimal operational performance for maximum efficiency. New sensor technologies and machine learning (ML) methods are building the foundations for self-awareness that would enable the machining process to identify those combinations of spindle speed and axial depth-of-cut needed to maintain dynamic stability of the process and, in the case of approaching chatter, identify the adjustments in these parameters that allow the process to quickly return to stability.

This research builds on a recent trajectory of research in machine learning referred to as physics-informed or physics-guided machine learning (PGML). To implement the PGML approach, a machine learning model is trained with stability data simulated using a physics-based model built on the well-known receptance coupling substructure analysis (RCSA), a mechanistic force model, and Fourier analysis to generate the stability lobe diagram (SLD). The SLD defines regions of stability and instability as a function of spindle speed and depth of cut, described in more detail later. This pre-trained model is then updated with experimental measured data, along with domain knowledge, to enable the convergence of the physics-based model to the true operational model. Physics-based theory also informs the updating strategies by guiding the sampling of measurement points prior to retraining the PGML. A major contribution of this paper is the exploration of measurement strategies that increase predictive accuracy of the PGML model while minimizing measurement cost. Experiments are implemented within a simulation framework that provides a common benchmark for evaluating the updating strategies. The research goal is to develop guidelines for sampling that promote fast and inexpensive convergence of the physics-based model to the true operational stability model suitable for use within a production environment.

This paper proceeds as follows. Sect. 2 introduces PGML and reviews recent literature in this area. Sect. 3 provides a perspective on the challenge of maintaining dynamic stability during machining motivating this work. Sect. 4 describes the development of the physics-based dynamic stability model and discusses model uncertainties that motivate the PGML approach. Sect. 5 introduces the PGML approach implemented in this research. Sect. 6 presents the research framework including machine learning methods and measurement updating strategies. Sect. 7 describes the methods for both training and updating the PGML model with simulated measurement data, domain knowledge and physical theory. Numerical results are presented in Sect. 8. Implications of this research for implementation in a production environment and as a step towards self-aware machining are discussed in Sect. 9 followed by conclusions in Sect. 10.

Physics-guided machine learning

When modeling complex machining processes, the practical choice—and associated best practice—has been to choose between physics-based or data-driven models for prediction. Both approaches have distinct advantages when applied to complex systems with integrated mechanical, electrical, and software components. However, individually, they fall short of delivering the broad capability and accuracy needed for chatter-free high precision machining. Informed machine learning is part of an emerging trend to embed domain knowledge, broadly defined, into ML models. This approach use physical principles to inform and guide the search for the best data-driven model, thereby capturing the best attributes of both physics-based and data-learning models, as shown in Fig. 1 (Greis et al., 2020).

Given the diverse scientific and other disciplines in which these techniques are being applied, this approach has been referred to by several names such as physics-guided ML, physics-informed ML, physics-based ML, physics-aware ML—or simply informed ML. A number of review papers have appeared in the last several years that offer various frameworks for organizing the theory and practice of informed machine learning (Kim et al., 2021). Willard et al. (2020) offer a taxonomy of physics-based modeling with machine learning techniques by application area and class of methodology. von Rueden et al. (2021) offer a taxonomy that considers the source of the knowledge (i.e., natural science, expert knowledge, or accepted physical law), its representation (i.e., algebraic equations, simulation results, or logical rules), and how it is integrated into the machine learning pipeline. Roscher et al. (2020) approach informed machine learning from the perspective of its ability to extract and explain novel scientific results. More recently, Karniadakis et al. (2021) offer a broad review of prevailing trends in informed machine learning, including capabilities and limitations. In addition, a growing body of research on physics informed neural networks (PINNs) focus on integration of physical knowledge that can be described as partial derivative equations directly in the neural network structures or their loss functions (Cuomo et al., 2022; Raissi et al., 2017a, b, 2019). As is evident from the current literature, a definitive nomenclature has not yet emerged. This research implements three approaches to incorporate physics-based knowledge into the model. A physics-based stability model generates the simulation data that determine the baseline machine learning model. Domain knowledge that can be considered to be expert knowledge captured in logical rules informs the determination of non-measured stability value during updating. And mathematical expressions of the stability model guide the measurement sampling process during updating. Given these different approaches, the overall method proposed herein is referred as PGML, recognizing that the naming choice is not straightforward given research streams in the current literature.

Fig. 1
figure 1

Physics-Guided Data Learning

Data-driven approaches are built on large sets of historical data and can learn directly from real-time sensor data (e.g. vibration, temperature, acoustic emissions, etc.) collected during machining. Advantages include the ability to model highly complex physical systems for which there is no underlying physical model that completely defines the system, or where the relationships between the input and output variables are difficult to describe using simple mathematics, or when the ability to include contextual data (e.g. environmental conditions, changes in operating regime, etc.) is important. A challenge with black box data-driven models is that they are agnostic to physical law because they rely only on data and not theory. They are, also, therefore dependent on the available data for training which can lead to relationships that do not generalize beyond the range of the training dataset. Since data-driven model predictions are generally limited to what the training set has seen, these models are not always useful for generating new scientific knowledge. Physics-based models are still preferred for scientific discovery. However, PGML models have been shown to be capable of revealing new knowledge. These insights can be attributed in part to the inclusion of domain knowledge of the underlying physics explicitly into the PGML modeling process.

The application of machine learning to stability prediction in milling, in particular, has been an area of keen and accelerating interest (Sharp et al., 2018; Oleaga et al., 2018; Cherukuri et al., 2019; Tao et al., 2018) and several excellent review papers have been published (Kim et al., 2018; Wang et al., 2018). In addition, Karandikar et al. (2020) describe a novel Bayesian learning approach for stability boundary and optimal parameter identification in milling without the knowledge of the underlying tool dynamics or material cutting force coefficients using a limited number of data points. Cornelius et al. (2021) introduce a Bayesian framework for identifying the milling stability boundary and system parameters for reverse parameter identification through iterative testing. Several recent papers incorporate data collected during simulation and real-time operations to the creation and adaptation of stability models. Friedrich et al. (2017; 2018) implement machine learning to predict chatter using scenario-specific milling data obtained from a simulation model, and then extend that model to allow for continuous learning and time-variant systems using measured acceleration signals. Saadallah et al. (2018) implement a machine learning framework based on results of a geometric physically-based simulation with varied process parameter values and refined using an active learning approach.

PGML offers the next advance by formally incorporating physical theory and domain knowledge of the underlying physics into the machine learning pipeline. Early work in PGML focused on non-manufacturing applications characterized by degrees of complexity that resist capture by traditional physical models, but for which large amounts of data are available for machine learning (Karpatne et al., 2017). These include physical phenomena such as turbulent flow (Singh et al., 2017) and geoscience applications such as hydrologic modeling and climate change (Karpatne et al., 2017; Sheikh & Jahirabadkar, 2018; Faghmous & Kumar, 2014). Areas of application have expanded recently into other domains including structural dynamics (Yu et al., 2020), power system management (Wang et al., 2020a), and modeling of seismic events (Zhang et al., 2020).

PGML is a relatively newly explored methodology in machining. Several recent applications can be noted. Wang et al. (2020b) construct a physics-guided gated recurrent units (GRU) machine learning model for continuous prediction of tool wear that considers complex tool cutting conditions and dynamic changes of physical parameters experienced in practice. Lee et al., (2018) introduce a physics-based artificial neural network for online monitoring of steady-state tool temperatures at the tool/chip interface. Lu et al. (2017) model the electrochemical micro-machining process by embedding knowledge about the relationship between the input process parameters and intermediate outputs, as well as domain knowledge about the mechanisms of the micro-machining process, into a neural network structure.

Most recently, physics-informed or physic-guided approaches for stability modeling during machining operations have been implemented within a transfer learning environment that allows for the effects of the changing process dynamics and highly complex cutting operations. Postel et al. (2020) utilize deep neural networks pre-trained with simulated data to match network predictions with experimentally observed stability states acquired under different cutting conditions using ensemble transfer learning to combine predictions. Similarly, Unver & Sener (2021) combine analytical solutions and convolutional neural networks within a transfer learning framework. The primary classifier AlexNet is pre-trained on analytically developed and labeled stability solutions before being fed experimental vibration data collected during milling. Yesilli et al. (2020) use wavelet packet transform and ensemble empirical mode decomposition, two well-known chatter detection algorithms, to not only classify acceleration signals associated with chatter, but also to transfer knowledge from one cutting configuration to another.

Challenges of dynamic stability modeling for chatter detection

Dynamic stability of cutting processes such as milling is a function of the dynamic behavior of the tool and the workpiece during cutting. CAD/CAM software generally treats machining as a geometric effort. As long as the cylindrical tool follows the required path through the prismatic work material imparting the desired geometry, it is assumed that the machining process is acceptable. This approach does not consider constraints imposed by machining dynamics. For example, some combinations of spindle speed and axial depth-of-cut will exhibit self-excited vibrations, or chatter, which produce large forces, vibrations, unacceptable surface finish, and potential damage to the tool, part, and spindle. Additionally, even if stable behavior is obtained, the geometric accuracy of the machined part may or may not satisfy design tolerances, again depending on the selected spindle speed and axial depth-of-cut combination. Stability models that capture the true machining dynamics on the shop floor are desired to select spindle speed and axial depth-of-cut combinations that avoid chatter, while providing the required geometric accuracy.

Combinations of spindle speed and axial depth-of-cut that result in a stable machining process are represented by a stability lobe diagram (SLD) which defines stable and unstable stability regimes as a function of these two machining parameters. SLDs can be computed analytically or experimentally. In order to determine the SLD experimentally, a large number of combinations of spindle speed and axial depth-of-cut must be tested for chatter which is time-intensive, as well as materially and computationally expensive. Analytical and physics-based models rely on theory derived through experiment and expressed notationally. A limitation of stability models and SLDs derived analytically is that they rarely exactly mirror the SLDs that govern operations on the factory floor. The combinations of spindle speed and axial depth-of-cut that yield stable dynamics in an operational environment depend not only on model uncertainties, but also the dynamics of the particular milling machine, conditions in the ambient operational environment, and also conditions such as tool wear.

Further, an SLD calculated once is only true for that machine working on that workpiece and in that environment, and only for that time and not for the whole lifetime of that machine. To analyze the underlying, and operational, stability model, measurements must be repeated periodically as operating or machine conditions change over time. While a manufacturer may provide operational guidelines for initial parameter setting, they are not specific to the particular production environment, the age of the machine, or the acquired skill of the particular operator. Thus, analytical models fail to meet the essential criterion for self-awareness—a stability model and associated SLD that have been calibrated to the production environment and that can provide feedback to an operator for manual adjustments in real time to regain stability at the onset of chatter and ultimately to the machine itself for automatic self-adjustment.

The lack of alignment of the physics-based model and true SLD due to its time-varying nature present both computational and operational challenges for chatter detection in a production environment. For setting initial cutting parameters, a method is required that can align the physics-based or manufacturer-supplied SLD with the true SLD so the operator can select initial parameters. As the machine performs over time, an automated method is required to update the physics-based SLD as conditions change. Thus, the model must be capable of handling time-varying conditions. The method also has to efficiently accommodate continuously collected training data since all the measured data may not be collected by experiment at one time, but periodically during production. As manufacturers move toward self-aware machining, this data may be automatically collected, chatter detected, and the SLD adjusted autonomously, so that the machine itself is able to self-adjust cutting parameters to maintain stability.

Development of the physics-based dynamic stability model

The PGML approach implemented here begins with the development of a physics-based model for milling. Milling is a machining process that uses a rotating cylindrical cutting tool to remove material from the surface of a workpiece. During the milling process, the cylindrical tool follows a predetermined tool path to achieve the desired geometry of the workpiece. As discussed above, under certain dynamic conditions, the milling process will exhibit chatter—a self-excited vibrational state that leads to instability and uncontrollability of the system. Development of physics-based models to predict chatter during milling are based on an understanding of the vibrational behavior of the tool-holder-spindle-machine assembly (and sometimes the part) which imposes specific measurement requirements for model-building. First, the vibrational behavior at the tool tip is traditionally described by the frequency response function (FRF) which is obtained through modal testing. A popular approach is to excite the structure in question using an instrumented hammer and to use a low mass accelerometer attached to the structure to record the subsequent time-domain vibration response. The frequency domain displacement-to-force ratio is the FRF, or receptance, for the tool tip (or part).

While the measurement procedure is well-understood, the lack of widespread availability of modal testing equipment and associated expertise has hindered the implementation of machining modeling on the shop floor. This points to a second need; the tool tip receptances must be identified without physical measurement of each. Because the tool is clamped in a holder that is inserted in a spindle attached to the machine, tool tip receptance prediction is not trivial. Additionally, the tool tip receptance and machining models are deterministic, but include inherent uncertainties. The predicted machining parameters are, therefore, also uncertain. This establishes the need for uncertainty reduction through testing. To build the physics-based dynamic stability model and its SLD, three physics-based models are used. First, receptance coupling substructure analysis (RCSA) is used to predict the tool tip receptance. Second, a mechanistic force model is used to relate the cutting force to the commanded chip area through cutting force coefficients. Third, a mean force frequency domain analysis is used to predict the stability limit using the first two models as input. These models are briefly described in the following sections.

Receptance coupling substructure modeling

Using three-component RCSA for tool tip dynamics prediction has been previously demonstrated (Schmitz & Donaldson, 2000; Schmitz & Duncan, 2005; Schmitz & Smith, 2009). In prior efforts, the free-free boundary condition tool and holder were modeled as cylindrical cross-section Timoshenko beams. These beam receptances were coupled analytically to measured receptances of the spindle-machine. The sequence of steps for tool point receptance prediction are: (1) calculate the tool receptances (free-free boundary conditions) using the Timoshenko beam model; (2) calculate the holder receptances (free-free boundary conditions) using the Timoshenko beam model; (3) measure the spindle-machine receptances using impact testing; and (4) couple these receptances to predict the tool-holder-spindle-machine assembly dynamics using either rigid or flexible-damped compatibility conditions. The approach is summarized in Fig. 2.

Fig. 2
figure 2

Summary of RCSA Procedure

The coupling procedure is described by Fig. 3 and Eqs. (1)-(3) below. Figure 3 shows that both linear and translational coordinates are required. This leads to four receptances that relate displacement, x, and rotation, θ, to force, f, and moment, m.

Fig. 3
figure 3

RCSA Coordinates

The four direct receptances are shown in Eq. 1 for the right end of the cylinder (tool) pictured in Fig. 3. The four component receptances are organized into a generalized coordinate format in Eq. 2. Using this formulation, the direct tool point receptances (uppercase coordinates) are computed for a rigid connection using the component direct and cross receptances (lowercase coordinates) in Eq. 3. Flexible/damped connections may also be selected. In Eq. 3, the displacement-to-force H11 receptance provides the input to machining simulations.

$$\begin{array}{cc}{h}_{11}=\frac{{x}_{1}}{{f}_{1}}& {l}_{11}=\frac{{x}_{1}}{{m}_{1}}\\ {n}_{11}=\frac{{\theta }_{1}}{{f}_{1}}& {p}_{11}=\frac{{\theta }_{1}}{{m}_{1}}\end{array}$$
(1)
$$\left\{\begin{array}{c}{x}_{1}\\ {\theta }_{1}\end{array}\right\}=\left[\begin{array}{cc}{h}_{11}& {l}_{11}\\ {n}_{11}& {p}_{11}\end{array}\right]\left\{\begin{array}{c}{f}_{1}\\ {m}_{1}\end{array}\right\}, \left\{{u}_{1}\right\}=\left[{R}_{11}\right]\left\{{q}_{1}\right\}$$
(2)
$${G}_{11}={R}_{11}-{R}_{12a}{\left({R}_{2a2a}+{R}_{2b2b}\right)}^{-1}{R}_{2a1}=\left[\begin{array}{cc}{H}_{11}& {L}_{11}\\ {N}_{11}& {P}_{11}\end{array}\right]$$
(3)

Force model

In mechanistic force modeling for milling, the cutting force coefficients, k, are calculated using the commanded axial depth-of-cut, b, and chip thickness, h. The model described in Eq. 4 includes force components that are tangential, t, and normal, n, to the rotating cutting edge. Force coefficients that relate the force to chip area, bh, are identified by a c subscript (cutting or shearing force). Those that relate the force to axial depth alone have an e subscript (edge or rubbing force).

$$\begin{array}{c}{F}_{t}={k}_{tc}bh+{k}_{te}b\\ {F}_{n}={k}_{nc}bh+{k}_{ne}b\end{array}$$
(4)

These coefficients may be determined by experiment where the cutting force is measured using a force dynamometer and the commanded axial depth-of-cut and chip thickness are known. Linear regression over a range of chip thickness values and nonlinear least squares fitting to the time-domain force have been applied (Rubeo & Schmitz, 2016). As an alternative, the material behavior can be defined using a constitutive model and the cutting force predicted using finite element simulation (Shi & Liu, 2004).

Frequency domain stability analysis

The analytical stability limit may then be determined using Fourier force analysis to transform the dynamic milling equations into a time-invariant, but radial, immersion dependent system (Altintas & Budak, 1995). This analysis expands the frequency domain dynamic milling equations into a Fourier series. The series is then truncated to include only the mean component. The tool point receptance and force model are inputs to the analysis. The dynamic milling process model is derived by considering the Fourier series expansion of the time-varying milling force coefficients. The eigenvalues of the dynamic milling expression are calculated analytically by selecting a chatter frequency around the dominant structural modes. Noting that the axial depth-of-cut is always a real quantity, the chatter free axial depths-of-cut and spindle speeds are analytically formulated as a function of the tool tip receptances, the cutting force coefficients, the number of teeth, milling orientation (up and down), and radial depth-of-cut.

The individual stability lobes that determine the SLD are computed by, first, selecting a chatter frequency near a dominant mode from the tool tip receptances and solving the eigenvalue equation. The critical axial depth-of-cut and spindle speed are calculated for each stability lobe. This procedure is repeated by scanning the chatter frequencies around all dominant modes from the tool tip receptance. The final output is the limiting axial depth-of-cut as a function of spindle speed for a selected radial depth-of-cut, milling orientation, and number of teeth on the endmill. The graphical representation of this output is the SLD that defines the stability boundary that is the subject of this research.

In addition to the operational and model uncertainties noted earlier, the deterministic models described in the previous section also include measurement uncertainty. For example, the actual extension length of the endmill from the holder is subject to set-up and measurement uncertainties. This results in uncertainty in the tool tip receptance which, in turn, leads to uncertainty in the stability limit. Propagation of uncertainties in the tool and holder models, spindle receptances, and cutting force coefficients to uncertainty in the stability limit may be computed using Monte Carlo simulation (Karandikar et al. 2010). This provides a predictive model, where a probabilistic, rather than deterministic, stability limit is presented. However, if a test is performed to determine the actual stability behavior of a spindle speed-axial depth combination, there is no straightforward mapping between this result and input parameters.

PGML approach to stability modeling

The PGML approach developed herein is described conceptually in Fig. 4. The approach combines two models. First, a PGML model is created from the physics-based dynamic stability model described in the previous section. The physics-based stability model is used to generate an initial training dataset that is comprised of three parameters, the target stability state—unstable (or chatter) and stable (or no chatter)—and two physics-based input parameters spindle speed and axial depth-of-cut. A machine learning method then “fits” a baseline PGML model to this initial physics-based training dataset. The baseline PGML model should have good fit with the physics-based data to serve as a benchmark for determining predictive accuracy improvements during updating with measured data and domain knowledge. Then, in an iterative process, points in the initial physics-based training dataset are updated with measured (off-line or in-process) stability values; points whose true stability state is known by domain knowledge are also updated. The PGML model is retrained at each iteration. The points to be sampled for measurement at each iteration are selected according to novel updating strategies. As the updating process proceeds iteratively, the PGML SLD is hypothesized to converge toward the SLD of the true stability model. As convergence occurs, the predictive accuracy of the PGML model increases proportionately.

Fig. 4
figure 4

Physics-Guided Machine Learning Approach

Physics-based knowledge is incorporated by three mechanisms shown by shaded red boxes in Fig. 4 with the solid outline. First, the physics-based stability model on the left generates an initial physics-based training dataset composed of physics-based stability values for combinations of spindle speed and axial depth-of-cut. The baseline PGML stability model is trained using this baseline training dataset. Second, selection of points to be measured experimentally or on the production floor is guided by the physics-based SLD. Third, non-measured data points that are known to have true stability (stable or unstable) are updated by domain knowledge. Consistent with the framework in Fig. 4, numerical simulation experiments were performed to test whether a physics-based stability model, with associated operational, model and measurement uncertainty, can be updated with measured data and domain knowledge to better approximate the true underlying stability model and, thus, improve predictions of instability.

Since the true underlying stability model and its SLD are typically unknown in practice, both physics-based and true stability models are simulated for these experiments. For the physics-based model described in Sect. 4, errors are intentionally introduced into the model inputs so that the predicted stability limit used to train the PGML model includes uncertainty. However, because the input errors are known, it is possible to determine the true (zero uncertainty) stability limit as well. This error-free stability model is used to generate true experimental data, which replaces the uncertain stability limit defined in the uncertain initial training dataset during updating.

For the purposes of these numerical experiments, the physics-based SLD is simulated as follows. Errors are introduced to both the RCSA tool tip receptance prediction and the force model to incorporate uncertainty. These errors include the extension length of the tool inserted in the holder for the RCSA model and the cutting force coefficients for the force model. The SLD is defined with and without these errors, where the SLD provides the limiting axial depth-of-cut as a function of spindle speed for a given radial depth-of-cut, milling orientation, and number of endmill teeth.

The two SLDs generated for the following experiments are shown in Fig. 5. The SLD on the left has been generated from the physics-based stability model with introduced errors; the true SLD on the right has been generated from the physics-based stability model without errors. As can be seen, there is a wide divergence between the SLD with errors and the SLD without errors (i.e., the true SLD). For this exercise, the correct values for the milling setup are: (1) 12 mm diameter tool with 4 teeth and a 53 mm extension length from the holder; and (2) a force model with coefficients \({k}_{tc}\) = 692.8 N/mm2, \({k}_{nc}\) = 400.0 N/mm2, \({k}_{te}\) = 0, and \({k}_{ne}\) = 0. These parameters were used to define the right SLD in Fig. 5.

Because there is uncertainty in these inputs, errors were introduced to define the starting SLD displayed on the left in Fig. 5. Errors were introduced for the tool extension length and force model coefficients. These erroneous values were: (1) tool with 50 mm extension from the holder; and (2) force model coefficients \({k}_{tc}\) = 649.0 N/mm2, \({k}_{nc}\) = 262.2 N/mm2, \({k}_{te}\) = 0, and \({k}_{ne}\) = 0. The same spindle and holder receptances were used in both cases and the same stability algorithm was applied (with modifications to the tool geometry and force model). Up milling with a radial depth of 2 mm was selected.

Fig. 5
figure 5

Comparison of SLD with errors (left) and true SLD (right)

Research framework

Simulation experiments were conducted over a 2-dimensional grid of 2,020 points described by spindle speed and axial depth-of-cut space as illustrated in Fig. 6. Spindle speeds ranged from 1.0 × 104 to 2.0 × 104 rpm along the horizontal x-axis, in increments of 100.0 rpm; axial depth-of-cut values ranged from 0 to 20 mm along the vertical y-axis, in increments of 1.0 mm. The training dataset is shown as solid circles. For testing the PGML models, another dataset of points was created using the midpoints of alternate x-axis and y-axis intervals in the training grid, also illustrated in Fig. 6. This out-of-sample test set includes 500 points shown by the open circles in the figure. Each combination of axial depth-of-cut and spindle speed in both the training and test sets is associated with two physics-based stability values (physics-based or true) computed as described in Sect. 5.

For baseline training of the PGML model prior to updating, each spindle speed and axial depth-of-cut point in the training set of 2,020 points is associated with a physics-based stability value. During training, the PGML model is “fit” to this baseline training dataset of 2,020 points using three different machine learning methods and then tested against the test set of 500 points. To update the PGML model, the true stability states of measured points are sampled from a set of true stability values, computed as previously described, and their values updated in the PGML training dataset. In practice, as discussed earlier, the true states are unknown. For the purposed of this simulation, true stability values are defined by the physics-based model without errors, as described earlier. Thus, for each combination of spindle speed and axial depth-of-cut in both the training data set of 2,020 points and the test dataset of 500 points, there are two associated stability states, the simulated physics-based value and the simulated true stability value. Over the sequence of iterations, as more points are sampled, measured, and updated, the baseline PGML training dataset contains fewer physics-based stability values and an increasing number of true or validated stability values.

Fig. 6
figure 6

Grid Design for Simulation Experiments

Training the physics-guided machine learning model

To evaluate the performance of the proposed approach, PGML models for each updating strategy were implemented using three machine learning methods: (1) K-Nearest Neighbors (KNN), (2) Support Vector Machines (SVM), and (3) Artificial Neural Networks with Nesterov-Accelerated Adaptive Moment Estimation (ANN-NADAM). Each reflects a different learning approach that captures different characteristics of the stability data being modeled. Four updating strategies are evaluated: (1) Random Sample without Domain Knowledge (RAN), (2) Random Sample with Domain Knowledge (RAN-DK), (3) Climb-the-Hill with Domain Knowledge (CTH-DK), and (4) Follow-the-Curve with Local Search and Domain Knowledge (FTC-LS-DK). In total, 12 experimental designs were created.

K-Nearest Neighbors (KNN)

KNNs (Tran et al., 2021; Deshmukh & Bhosle, 2018) are a non-parametric supervised machine learning method used largely for classification but also regression. Also known as “lazy learner”, KNN makes decisions by referring to the K data points closest to the data point of interest. The distance between any two data points is calculated by using any of several metrics. Common choices are Manhattan, Euclidean, and Minkowski distances. KNN (1) does not make strict assumptions over the distribution of the dataset, unlike linear regression; (2) performs especially well under the infinite sample set assumption, with less probability of error than any other decision rule; and (3) does not produce a generalized rule over the dataset, which means it requires less training time—an advantage for in-process sampling and measurement. However, it stores all training data during the validation phase which requires more memory. Hyperparameters for the KNN model are provided in Table 1.

Table 1 K-Nearest Neighbors Hyperparameters and Value Ranges

Support Vector Machine (SVM)

SVM (Peng et al., 2015; Wan et al., 2021) is a supervised machine learning method that is used largely for classification, but also prediction. Like ANNs, SVMs infer a function from labeled training data consisting of a set of training examples of paired inputs and outputs. The objective of the SVM algorithm is to find a hyperplane in an N-dimensional space where N is the number of input features that distinctly classifies the data points. For example, binary classification is performed by finding the hyperplane that best differentiates between two classes, i.e. maximizes the margin between the hyperplane and the support vectors, or closest values to the classification margins. The use of kernels can transform linearly inseparable problems into linearly separable ones. Given the highly nonlinear SLD, SVM’s ability to transform linearly inseparable problems may be an advantage. Hyperparameters for the SVM model are provided in Table 2.

Table 2 Support Vector Machine Hyperparameters and Value Ranges

Artificial neural network with Nesterov-Accelerated Adaptive Moment Estimation (NADAM)

NADAM is a type of gradient descent optimizer used in neural network models that minimizes the cost function by finding the optimized values for the weights during updating. NADAM is typically used in the case of noisy gradients or gradients with high curvatures. One limitation of gradient descent is that a single learning rate (i.e. step size) is used for all input variables. Extensions to gradient descent like the Adaptive Moment Estimation (ADAM) algorithm add a first and second moment of the gradient and automatically adapt the learning rate for each parameter that is being optimized. This approach may result in a step size that rapidly decreases to very small values. NADAM is an extension of ADAM that incorporates Nesterov momentum (NAG) and can result in better performance of the optimization algorithm. NAG is an extension to classical momentum where the update is performed using the gradient of the projected update to the parameter rather than the actual current variable value. This has the effect of slowing down the search when the optimum is located rather than overshooting. Thus, ANN-NADAM may have an advantage when fitting SLDs with large numbers of lobes and associated optima. Recent applications of ANN to machining include chatter prediction in milling (Mishra & Singh, 2022) and a forward prediction model for electrochemical micro-machining with knowledge embedded into an ANN (Lu et al., 2017). Hyperparameters for the ANN model are provided in Table 3.

Table 3 Artificial Neural Network with NADAM Hyperparameters and Value Ranges

Updating the physics-guided machine learning stability model

The PGML model is updated with simulated true stability states that would be measured in an operational environment, as described earlier. At each iteration, the replacement of the baseline physics-based (and uncertain) stability values with true stability values will force convergence of the physics-based SLD towards the true SLD for that particular milling machine, operational environment, and status of tool wear. By updating the PGML model with measured points, augmented by domain knowledge points, the process by which measurement data is captured on the factory floor can be simulated. An important question, then, is how to sample points for measurement to promote fast and efficient convergence of the PGML SLD to the true underlying SLD—and thus achieve the highest predictive accuracy with the fewest measured points.

Four updating strategies are implemented, as described and illustrated below, to evaluate the speed and efficiency of convergence of the PGML SLD to the true SLD at each iteration. The dashed red line represents the physics-based simulated PGML SLD, while the solid black line is the true simulated SLD. For the purpose of determining efficiency of convergence, we distinguish between points that are measured—points that are, in practice, experimentally determined during production and for which there is an acquisition cost in terms of production interruption or downtime—and points that do not have any associated cost because they are determined by domain knowledge rather than experiment. Points can be measured or updated only once to avoid double counting although there is no cost associated with updating a domain knowledge point. All measured points are updated unless they have been updated by domain knowledge in a previous iteration.

Random sample without domain knowledge (RAN)

Random Sample without Domain Knowledge assumes no a priori theory-based or expert domain knowledge to guide selection of measured points and, thus, provides a benchmark against which the information value of prior knowledge can be evaluated. Points are sampled randomly across the experimental grid without replacement. No points are updated by domain knowledge. To start the updating process, 1,000 points are randomly sampled from the grid. At each of 10 update iterations, 100 points are randomly selected from the initial sample set of 1,000 points. The true stability states of these points are measured and their respective values are updated in the PGML training dataset. When all 100 sample points in each iteration have been measured and updated, the PGML model is retrained. Figure 7 illustrates the RAN updating strategy for a small set of 13 randomly sampled points. In the left-hand figure, solid red points that lie above the PGML SLD, shown as a red dashed line, are unstable prior to updating. Conversely, solid green points below the PGML SLD are stable prior to updating. After measuring the true states of all 13 points, four points have been updated, as shown in the grid on the right. Two unstable points have been updated to stable, indicated by the arrows and light green outlined points; similarly, two stable points have been updated to unstable, indicated by arrows and light red outlined points. No points have been updated by domain knowledge. After 10 iterations, all 1,000 sampled points have been measured and their stability states updated in the PGML training dataset.

Fig. 7
figure 7

Random Sample. Sampled points at first iteration (left) and measured and updated points (right)

Random sample with domain knowledge (RAN-DK)

Random Sample with Domain Knowledge leverages domain knowledge, or physics-based knowledge, to update additional grid points to their true value, but without measurement. The addition of domain knowledge points increases the information content of the training set and speeds convergence to the true SLD. It is known from theory (and practice) that all values of axial depth-of-cut above those that have been measured to be unstable must also be unstable—assuming constant spindle speed. Similarly, all points below the axial depth-of-cut of a measured stable point must also be stable—assuming constant spindle speed. The addition of domain knowledge points also reduces the number of points that need to be measured to achieve target convergence. In Fig. 8, the domain knowledge points that have been updated to stable below a measured stable point—or unstable above a measured unstable point—are shown in light gray without an outline.

Fig. 8
figure 8

Random Sample with Domain Knowledge. Measured stable and unstable points (left) and updated domain knowledge points (right)

Climb-the-Hill with domain knowledge (CTH-DK)

Climb-the-Hill with Domain Knowledge leverages a strategy that discovers the unknown (in practice) underlying SLD by systematically measuring points at the lowest depth-of-cut and incrementing upward until chatter is observed. The process is analogous to how many manufacturers develop their guidelines for parameter selection. The process is initiated by measuring and updating all points with axial depth-of-cut equal to 1.0 mm and spindle speeds ranging from 1.0 × 104 to 2.0 × 104 rpm. At each subsequent iteration, as samples “climb-the-hill,” the axial depth-of-cut is incremented one unit. In the following experiments, each incremental unit equals 2 mm. All points at the incremented depth-of-cut that lie above a point measured to be stable in the previous iteration are measured and updated. Further, since all points above a measured unstable point are unstable, all points above a measured unstable point are also updated as unstable in the PGML training dataset through domain knowledge. As illustrated in Fig. 9, on the left, all points at 1.0 mm axial depth-of-cut are measured to be stable during the first iteration and are shown in solid dark green. Therefore, at the second iteration on the right, all points are measured at axial depth-of-cut equal to 3.0 mm. During the second iteration, five points are measured to be unstable and are shown outlined in light red. Their values are updated to unstable in the PGML training dataset, along with all points above it by domain knowledge. Domain knowledge points are shown in light gray. Similarly, at the third iteration at axial depth-of-cut of 5.0 mm, only points above a stable point in the previous iteration are measured. Seven additional points are measured as unstable and their values are updated to unstable in the PGML training dataset, along with all domain knowledge points above them. The process continues similarly at each iteration.

Fig. 9
figure 9

Climb-the-Hill. Measured stable points at first iteration (left) and updated measured and domain knowledge points at second and third iterations (right)

Follow-the-Curve with local search and domain knowledge (FTC-LS-DK)

Follow-the-Curve with Local Search and Domain Knowledge guides the process of selecting points for measurement along the physics-based SLD where the information value of the measured points is hypothesized to be highest due to its likely proximity to the true underlying SLD. The selection strategy is completed by domain knowledge and a local search process which also focuses measurement in the area of highest information value. Points are measured at equal intervals of spindle speed along the baseline PGML SLD and in increasingly smaller increments of spindle speed at each iteration. Follow-the-Curve also includes a local search routine that measures points around the sample points in a prescribed way. The strength of the Follow-the-Curve strategy is that the local search allows it to gain more information, more quickly, about the location of the true SLD. Domain knowledge is implemented for both measured sample and for local search measured points around the sample points on the SLD.

As shown in Fig. 10, Follow-the-Curve begins by selecting points at regular intervals of spindle speed along the x-axis. The sample point to be measured is the point on the bsaeline PGML SLD associated with the selected value of spindle speed on the x-axis. For example, at the first iteration, 11 points are selected at equal increments of 1.0 × 103 rpm along the spindle speed axis from 1.0 × 104 to 2.0 × 104 rpm. The corresponding points on the PGML SLD are initially assumed to be stable. After measurement three points, shown in red, are unstable while eight points, shown in green, are confirmed stable. Measured values are updated in the PGML training dataset. At each subsequent iteration, midpoints for each of the spindle speed intervals on the x-axis, shown as outlined white circles on the right, are measured and updated, as necessary. The process continues until a final interval size of 100 rpm at Iteration 6, at which point every spindle speed value along the x-axis in the training dataset has been measured.

Fig. 10
figure 10

Follow-the-Curve. Sampled points at first iteration (left) and at second iteration (right)

After each sample point on the baseline PGML SLD is measured, a local search is implemented around that point. Adjacent points in each of the four compass directions (north, south, east and west) are sequentially measured and their values updated. The local search in each direction stops when the measured state matches the value in the PGML training dataset. Local search is illustrated in Fig. 11. Following the numbering, at Step 1 the sample point on the curve at spindle speed 1.4 × 104 rpm, indicated as a solid light green circle, is measured as stable and its stability value in the PGML training dataset is updated. At Step 2, proceeding in the south direction, all points below the measured stable point are updated as stable by domain knowledge without measurement. In all cases, if a sample point is measured to be stable, no further measurement is performed in the south direction and all points below are updated as stable without measurement. Similarly, no further measurement is performed in the north direction if a sample point is measured to be unstable and all points above that point are updated as unstable without measurement. Domain knowledge points are updated above or below the sampled point, but never in both directions.

Fig. 11
figure 11

Follow-the-Curve with Local Search and Domain Knowledge

Continuing with Step 3, to the west we measure the next adjacent point as stable which does not match the value in the PGML training dataset, and its value is updated in the PGML training dataset. All points below that point are updated as stable without measurement by domain knowledge and the local search to the west continues. An additional point is measured. Its stability state does not match the value in the PGML training dataset and is also updated. In Step 4, the next point measured is unstable and does not match the value in the PGML training dataset, indicated by a solid red circle. All points above that point are updated as unstable without measurement by domain knowledge and the local search to the west continues. Finally, at Step 5, the measured local search point to the west is measured as stable which matches the value in the PGML training dataset and the local search in that direction stops. Domain knowledge points below that point are updated as stable. Search in the north (Step 6) and east (Step 7) directions proceed similarly.

Experimental design for training and updating the PGML model

The workflow for implementing PGML model training and updating in each of the experiments is provided in Fig. 12. Each of three machine learning methods was paired with each of the four updating methods and metrics for evaluating their respective predictive accuracies and stability convergence scores according to the performance metrics described below. As illustrated in the flowchart in Fig. 12, three physics-based models are used to analytically determine the physics-based dynamic stability model and its associated SLD. This model informs the creation of two simulated datasets. The first is the physics-based dataset for baseline training of the PGML and the second is the true dataset from which the measurement values are sampled (simulating experimental data collection on the factory shop floor) for updating the PGML model. Values of these datasets populate a 2-dimensional experimental grid over the experimental domain of interest where the x-axis is spindle speed (rpm) and the y-axis is depth-of-cut (mm). A baseline PGML model is fit to the initial physics-based training dataset and predictions of the stability states are produced for the test set. Model accuracy and stability convergence are computed to serve as a benchmark for evaluating the convergence of the updated PGML models to the true underlying SLD. In an iterative fashion, measurements of true stability states are sampled from the simulated true dataset as specified by the updating strategy and the PGML dataset is updated. At each iteration, the stability states are predicted for the test set and performance metrics are computed to track convergence of the physics-based SLD to the true underlying SLD.

Fig. 12
figure 12

Process for Training, Testing and Updating PGML Model at Each Iteration

Performance metrics ATRAIN and ATEST express the percent of the points for which the predicted PGML stability value matches the stability value in the training dataset, and the percent of points for which the predicted PGML stability value matches the stability value in the test dataset, respectively. Training and testing accuracy are computed in the conventional way. Training accuracy, ATRAIN, is defined as the number of points in the training set for which a correct prediction (CP) is obtained by the PGML model when compared with the training dataset values, divided by the total number of predictions (P). Testing accuracy, ATEST, is defined as the number of points in the test set for which a correct prediction (CP) is obtained by the PGML model when compared with the test dataset values, divided by the total number of predictions (P). The general expression of accuracy, for both training and testing, with respect to their training and test datasets, is provided in Eq. 5 below:

$${\rm A}_{\rm TRAIN} (\%)= {\rm A}_{\rm TEST} (\%)= \frac{CP}{P}*100$$
(5)

Stability convergence, CSLD, expresses the percent of points for which the PGML-predicted stability value matches the true stability state at each iteration. Changes in the magnitude of CSLD from iteration to iteration reflect the speed with which the PGML SLD gets closer and closer to, or converges to, the true SLD. A high stability convergence value means that the PGML model agrees more closely with the true stability model and that the number of correct stability predictions by the PGML model is high. Similarly, increasing convergence scores at each iteration indicate that the PGML SLD curve is converging towards the true SLD curve and that the predictive capability of the PGML model is increasing. Stability convergence CSLD is formally defined as the number of points in the test set for which a correct prediction (CP) is obtained by the PGML model when compared with the true value, divided by the total number of predictions (P), as follows in Eq 6:

$${\rm C}_{\rm SLD} (\%)= \frac{CP}{P}*100$$
(6)

F1Score. The \(F\)1 score, also called \(F\)-measure, is the harmonic mean between recall and precision where precision is the number of true positive (TP) results determined by the model divided by the number of all positive (TP + FP) results, and recall is the number of true positive (TP) results divided by the number of all samples that should have been identified as positive (TP + FN). The F1 score takes both precision and recall into account at equal weights and generates a high score only if the number of true positives obtained is high compared to the other prediction outcome categories. The \(F\)1 score, computed as shown in Eq. 7, ranges from 0.0 to + 1.0 (expressed as a percentage) where a higher score indicates a better classifier performance.

$${F_1} = 2 * \frac{{{\text{precision}} * {\text{recall}}}}{{{\text{precision}} + {\text{recall}}}} = \frac{{2 * TP}}{{2 * TP + FP + FN}}$$
(7)

Area Under the Curve (AUC). The AUC measures the ability of a classifier to distinguish between classes. The AUC equals the probability that the classifier will rank a randomly chosen positive example higher than that of a randomly chosen negative example. The AUC, ranging between 0.0 and + 1.0 (expressed as a percentage), is typically used for binary classification problems. The higher the AUC, the better the model is at distinguishing between classes.

In addition to computing performance metrics ATRAIN, ATEST, CSLD, F1 and AUC, the number of measured points and domain knowledge points that are validated as stable or unstable at each iteration are also counted. Tracking point counts, especially counts of measured points that incur cost, provides insight into the information value of measurement and domain knowledge with respect to predictive accuracy. Depending on updating strategy, different numbers and locations of points will be measured and updated at each iteration. Similarly, different numbers and locations of domain knowledge points will be confirmed as stable or unstable at each iteration. These different spatial patterns of sampling and updating for each strategy drive improvements in convergence to the true SLD and make a particular updating strategy better than another from the perspective of predictive accuracy and measurement cost.

Numerical results

Numerical results addressing the following research questions are provided below: (1) Does the baseline PGML model, without updating, adequately fit the physics-based data, and which machine learning method provides the best fit as a baseline for subsequent updating; (2) Can the predictive ability of the baseline PGML model be improved by updating the PGML training dataset with measured data and domain knowledge, and which updating strategy provides the best convergence (i.e. predictive accuracy); and (3) Can predictive accuracy be further improved using physics-based knowledge of the SLD by implementing local search around measured points. Results for each of the 12 experiments are presented. Results from each experiment are averages of 100 runs. Reference to the physics-based and true SLDs in the following discussion of experimental results refers to the simulated SLDs, respectively.

Accuracy and convergence during baseline PGML model fitting

A first question is whether the baseline PGML model, without updating, adequately fits the physics-based data and which machine learning method provides the best fit. To compare methods, the baseline PGML model was trained using KNN, SVM and ANN-NADAM and metrics ATRAIN and ATEST were computed. Typically, ATRAIN results are higher than ATEST results because the points in the test dataset have not been evaluated or seen by the PGML model during training. Note that model fit here refers to the fit with the baseline physics-based dataset and does not reflect how well the PGML model predicts the true stability states. To evaluate how well the PGML model predicts true stability states, stability convergence CSLD is computed. Both accuracy and convergence results can help assess which of the machine learning methods will perform best when retraining the PGML model during updating.

Table 4 compares the training and testing accuracies for the three machine learning methods KNN, SVM, and ANN-NADAM, as well as two other useful metrics for classification problems with class imbalance, F1 and AUC scores. KNN provides the best training accuracy, correctly predicting all stability values in the training dataset. All methods, however, exhibit ATRAIN scores of 97.2% or higher. Performance against the test dataset is comparably good, with ATEST scores between 95.6% and 97.0% for all methods. Thus, all machine learning methods provide overall good fit with the physics-based model and presage good predictive accuracy during updating. Stability convergence CSLD results are also provided in Table 4 and range between 81.0% and 81.5%. ANN-NADAM achieved the highest CSLD score at 81.5%. All three methods provide a comparable baseline for evaluating predictive accuracy improvements due to measurement and domain knowledge updating and were retained for further experiments.

Table 4 Performance Results for Baseline PGML Model Fitting

However, to account for small differences in performance across the methods during updating, each updating strategy is paired with each machine learning method in12 experiments, as noted earlier. Interestingly, ANN-NADAM performed least well among the three methods with respect to both ATRAIN and ATEST, but slightly better with respect to CSLD. Average F1 scores and AUC scores are also presented for each method in Table 4. Beyond accuracy, the F1 score takes into account not only the number of prediction errors made by a model, but also captures the type of errors that are made. A model will obtain a high F1 score only if it can predict most of the positive, chatter cases (i.e. high precision) and if it can correctly identify most of the positive, chatter cases (i.e. high recall) in the data. As shown in Table 4, the methods have high F1 score varying from 93.8 to 95.9% for the test dataset. The higher the AUC score the better the ability of the method to distinguish between the positive and negative classes, i.e. chatter and no-chatter. The three methods’ AUC scores exhibit very good performance varying from 99.3 to 100% on the test dataset.

Convergence during PGML model updating

A second question is whether the predictive ability of the baseline PGML model can be improved by updating the PGML training dataset with operational data and domain knowledge, and which updating strategy provides the best convergence (i.e. predictive accuracy). Having confirmed that each machine learning method offers a good fit to the physics-based training dataset, and comparable stability convergence, all 12 experiments were performed for the four updating strategies described in Sect. 6.0 and all performance metrics computed for both training and testing. The stability convergence scores and point counts for each iteration are provided in Tables 5, 6, 7 and 8 for all experiments. Accuracy, F1 and AUC scores, computed for each iteration over all experiments, were as expected and are omitted due to space limitations.

Stability convergence, CSLD, is expected to increase as the PGML training dataset is updated with measured values and domain knowledge at each iteration consistent with the hypothesis that the PGML model is converging to the true stability model. Stability convergence, as noted earlier, reflects the increasing accuracy of the PGML model in predicting the underlying true stability values. This hypothesis is confirmed by the results in Tables 5, 6, 7 and 8. Conversely, and as expected, as stability values in the PGML training dataset are updated with measured values, stability convergence CSLD improves for all 12 experiments. Again, looking at KNN for the random updating strategy (RAN) in Table 5, CSLD increases from 81.0% at the baseline iteration to 88.3%, or an increase of 9.0% and ANN increases from 81.5 to 89.0% or an increase of 9.2%. Similar increases in CSLD from baseline iteration to final iteration are obtained for the other three updating strategies as shown in Tables 6, 7 and 8. Stability convergence is defined as the percent of correct stability predictions achieved during testing. All four strategies achieved a predictive accuracy of at least 88%. The Follow-the-Curve strategy achieved the highest predictive accuracy of 95% with KNN.

Table 5 Stability Convergence and Point Counts for Random (RAN)
Table 6 Stability Convergence and Point Counts for Random with Domain Knowledge (RAN-DK)
Table 7 Stability Convergence and Point Counts for Climb-the-Hill (CTH-DK)
Table 8 Stability Convergence and Point Counts for Follow-the-Curve (FTC-LS-DK)

The comparative gains in predictive accuracy by PGML models over physics-based models (equivalent to baseline PGML model) are summarized in Fig. 13. The results are as expected but contain some surprises. RAN, the benchmark against which other updating strategies can be compared, had the lowest CSLD improvement of 8.8% averaged over the three machine learning methods. RAN can be thought of as an uninformed method since no a priori knowledge is used to sample points for measurement and no domain knowledge is implemented. In contrast, RAN-DK, CTH-DK, and FTC-LS-DK all implement physics-based theory and/or domain knowledge in different ways. While neither RAN-DK nor CTH-DK assumes a priori knowledge of the physics-based SLD, they do utilize domain knowledge to increase predictive accuracy. FTC-LS-DK implements a priori knowledge of the physics-based PGML SLD to guide sampling directly on the physics-based SLD curve. FTC-LS-DK also has the benefit of local search. Local search is hypothesized to add information about the location of the true underlying SLD by additional measurement in close proximity to the true SLD. RAN-DK, which builds in domain knowledge, increased the percent improvement in CSLD over baseline to 15.0%. Surprisingly, RAN-DK, CTH-DK and FTC-LS-DK obtain equivalent results, as seen in Fig. 13. All three strategies improved CSLD over baseline by (on average) 15.0%, 15.9% and 16.0%, respectively, after a complete iteration cycle. Local search did not add substantially to improvements in CSLD when comparing FTC-LS-DK and CTH-DK strategies.

Fig. 13
figure 13

Percent Improvement in Stability Convergence Over Baseline After Updating

More important, while RAN-DK, CTH-DK and FTC-LS-DK produced comparable improvement in CSLD over their respective complete iteration cycles, the trajectories of improvement at each iteration are very different. Iterative improvements in CSLD for the four updating strategies are compared in Fig. 14. As can be observed in the figure, each strategy has a unique signature. Recognizing that the preferred updating strategy is the one that maximizes CSLD with minimal investment in measurement, the desired signature is a rapid increase in CSLD at early iterations when cumulative measurement costs are still low. As expected, RAN underperforms the other three strategies, achieving less than 90% convergence after 10 iterations. When equal numbers of points are selected at random without a priori theoretical knowledge, improvements from iteration to iteration can be expected to be approximately uniform, as observed by the linear signature. Any variability across ML methods for RAN can be explained by different spatial positioning of the measured points with respect to the PGML SLD at each of 100 repetitions of the experiment. RAN-DK approaches 95% stability convergence by the 10th iteration which can be explained by the influence of additional domain knowledge points. Declining CSLD improvement for RAN-DK at later iterations reflects fewer numbers of measured points since points in the sample to be measured have been already updated by domain knowledge at previous iterations.

In particular, very different behaviors can be observed for the CTH-DK and FTC-LS-DK strategies. While both achieve CSLD of 95%, the trajectories are quite different. With CTH-DK, stability convergence starts off slowly for the first two iterations and then increases rapidly before flattening after the fifth iteration. In contrast, FTC-LS-DK stability convergence increases sharply from baseline to the first iteration, then shows smaller increases at subsequent iterations. As reported in Tables 5, 6, 7 and 8, and illustrated graphically in Fig. 13, the percent increase in CSLD across the iteration cycles ranges between 8.1% and 9.2% for RAN, between 13.1% and 16.3% for RAN-DK, between 14.4% and 17.2% for CTH-DK, and between 14.0% and 17.3% for FTC-LS-DK.

Fig. 14
figure 14

Stability Convergence CSLD by Updating Strategy and Iteration

Convergence of the PGML SLD to the true underlying SLD at each iteration can be visualized graphically for FTC-LS-DK in Fig. 15. The true SLD is shown by a solid black line and the PGML SLD by a dashed red line. The darkly shaded areas between the true SLD and the PGML SLD, as well as the lightly shaded gap areas below the true SLD are regions of prediction error. The dark pink shaded areas indicate regions where unstable stability states are incorrectly predicted as stable by the PGML. The light pink shaded areas below the true SLD define regions in which the stability states are correctly predicted as stable; similarly, the white areas above both curves define regions where unstable states are correctly predicted as unstable. And lightly shaded areas between the two curves define regions where stable states are incorrectly predicted as unstable. At the baseline iteration a number of regions of prediction error, noted by dark shaded regions above the true SLD and lightly shaded regions below the true SLD, can be seen. As more measured points are updated in the training dataset, these regions decrease in size. By the last iteration, the updated PGML SLD has substantially converged to the true SLD with a few exceptions, e.g. in particular, the large area to the right side of the figure at high spindle speeds and a small lobe to the left size of the figure at low spindle speeds.

Fig. 15
figure 15

Convergence of PGML SLD to True SLD with FTC-LS-DK Updating Strategy

Local search stopping rules for Follow-the-Curve

To further improve the performance of the Follow-the-Curve strategy (FTC-LS-DK), three stopping rules that control the pattern local search around sampled points on the physics-based SLD were investigated. Local search gathers information of higher value in areas that are hypothesized to be close to the true underlying SLD. As described in Sect. 6, each local search point, is measured and updated in the four compass directions until the process is halted by the stopping rule in effect. For each measured local search point, domain knowledge points are also updated, without measurement. Stopping Rule 1 was implemented in the previous experiments.

Stopping Rule 1: Stop on Match. Local search in each compass direction stops when the measured value equals the value in the PGML training dataset and the local search measured point is updated. Domain knowledge points are updated above or below the measured local search points, as appropriate.

Stopping Rule 2: Stop on Mismatch. Local search in each compass direction stops when the measured value doesnotequal the value in PGML training dataset and the local search measured point is updated. Domain knowledge points are updated above or below the measured local search points, as appropriate.

Stopping Rule 3: Stop on Measured Sample Point Mismatch. The local search in each compass direction stops when the measured value does not equal the stability value of the measured sample point (the local search original starting point) and the local search measured point is updated. Domain knowledge points are updated above or below the measured local search points, as appropriate.

Figure 16 illustrates the different search behaviors of each stopping rule and allows visual comparison of the relative performance of each with respect to convergence. Graphs in the left column illustrate local search mechanics for Stopping Rule 1, Stopping Rule 2, and Stopping Rule 3, all at iteration 1. Graphs in the right column compare the convergence of the final PGML SLD to the true SLD after all six iterations. Stopping Rule 1 offers the best performance with fewer regions of prediction error after the last iteration. As previously, sample measured points are outlined in red; local search measured points in the four compass directions are shown in solid green; domain knowledge points are shown in light gray. Also, as previously, the baseline PGML model prior to Iteration 1 was trained using the physics-based training dataset. At subsequent iterations, the PGML model was retrained with updated training datasets. ANN-NADAM was used to train the PGML model at all iterations—the best performing of the three methods during baseline testing.

The mechanics of local search are easily seen in the top left in Fig. 16. Stopping Rule 1 measures points in each compass direction until the measured local search point matches the value in the PGML training dataset. Sample point A, indicated in the figure by the shaded red circle, is predicted by the baseline PGML model to be stable, but is measured in Iteration 1 as unstable. Since the point is measured to be unstable, all points to the north are updated as unstable, as shown. Moving to the east, the point to the right is measured as unstable, which matches the PGML model value and the local search in that direction ends. Next, a local search in the south direction is conducted measuring stable points until it hits Point B which measures stable, as predicted by the PGML model. Finally, local search turns to the west direction, measuring stable points until the measured value is equal to PGML predicted value of stable which stops the local search to the west.

The superior performance of Stopping Rule 1, and the reason for its selection as the local search stopping rule, can be seen by comparing the regions of prediction error in the graphs on the right. All three stopping rules result in regions of prediction error on the far right and left of the graph. However, Stopping Rule 2 fails to correctly predict stability in three additional regions, two of which are around the tall stability lobes. In two of the three regions the PGLM model predicts unstable states when the true states are stable while, in the third, the PGML model predicts stable states when the true states are unstable. Stopping Rule 3 reduces prediction errors around the tall lobes, but the third region of prediction error is only slightly reduced.

Fig. 16
figure 16

Comparison of Local Search Stopping Rule Behaviors for FTC-DK: Rule 1 (top row), Rule 2 (middle row), and Rule 3 (bottom row)

Numerical results in Table 9 comparing accuracy, ATRAIN and ATEST, and convergence CSLD for the three stopping rules are consistent with the above observations. With respect to CSLD, Stopping Rule 1 and Stopping Rule 3 achieve 92.1% and 91.5% convergence, respectively, both higher than Stopping Rule 2 at 89.4%. When differences in baseline fit for each stopping rule are considered, Stopping Rule 1 and Stopping Rule 3 report the highest improvements of CSLD over baseline of 12.9% and 12.1%, respectively. Stopping Rule 2 reports only 9.6% over baseline, respectively.

Table 9 Accuracy and Convergence for Local Search Stopping Rules (FTC-LS-DK with ANN-NADAM)

The superior performance of Stopping Rule 1, as well as Stopping Rule 3 can be explained by the larger numbers of grid points that are validated as true by measurement or domain knowledge. A reasonable expectation is that a stopping rule that maximizes the number of measured points that are updated, as well as domain knowledge points, would have an advantage. The number of points validated as to their true state by measurement and domain knowledge, shown in Table 10, also support the superior performance of Stopping Rule 1 and Stopping Rule 3. Table 10 compares the total points validated for each stopping rule over all six iterations, broken down into measured points (both on the curve and by local search) and unmeasured domain knowledge points. As shown in Table 10, Stopping Rule 1 achieves the highest convergence (92.1%) with the fewest measured points (332) and the most domain knowledge points (676), for a total of 1008 validated points. Stopping Rule 3 achieves comparable convergence (91.5%) with 432 measured points and 641 domain knowledge points, for a total of 1073 validated points. Stopping Rule 2 has the fewest validated points (962) and, as observed earlier, the domain knowledge points are not in locations that provide leverage to reduce the size of the regions of prediction error.

Table 10 Point Counts for Local Search Stopping Rules (FTC-LS-DK with ANN-NADAM)

Another factor contributing to the superior performance of Stopping Rule 1 is the percent of total points in the PGML dataset that have been updated to their true state at the end of a complete iteration cycle. A particular stopping rule may, by chance, validate more points than another stopping rule that change their state to the true stability value since they sample different sets of points. However, at the end of the iteration cycle, nearly all grid points are expected to have been updated to their true value through measurement or domain knowledge. To confirm this, the number of true points as a percent of total points in the grid, is shown in Table 9. At the end of the iteration cycle, the percentages of validated true points for Stopping Rules 1, 2 and 3 are 96.1%, 91.6%, and 95.6%, respectively.

Balancing predictive accuracy and measurement cost in the factory

Measuring the stability states for combinations of spindle speed and axial depth-of-cut can be prohibitively expensive for many machine tool companies and job shops. In a production environment, keeping manufacturing costs low is a priority. And while avoiding chatter both improves surface quality and reduces the expense of rework and waste, the costs and time needed to implement the PGML approach in practice can be sizeable and must also be considered. From a practical perspective, then, an important consideration for factory floor operations is selection of an updating strategy that balances predictive accuracy (i.e. stability convergence) and measurement cost. Assessing this trade-off is a major goal of this research.

Each of the four updating strategies investigated here offers a different approach for gathering measurement data during machining in order to avoid chatter. In particular, numerical experiments were designed to evaluate the ability of the PGML approach to locate the true underlying SLD with minimal process measurements. Control of the measurement process from iteration to iteration was defined by novel updating strategies that update points through measurement, physics-based theory or domain knowledge. Since the numbers and locations of updated points vary by updating strategy at each iteration, it is difficult to compare strategies directly by looking at stability convergence CSLD and measured point counts separately. While these metrics tell us a lot about the predictive accuracy of the PGML model overall, they do not provide practical insight for applications that depend on the information efficiency—or the information contribution of each measured point to gains in predictive accuracy. For decisions on the factory floor, then, the critical question is which and how many points should be measured to quickly locate the true position of the SLD. That is, what is the marginal value of measuring an additional point to the convergence of the PGML SLD to the true SLD. And additionally, how many points need to be measured to achieve a target stability convergence—that is, how many points need to be measured if one wants to be 95% accurate in their predictions of dynamic stability.

This trade-off between stability convergence CSLD and number of measured points is provided in Fig. 17 for each updating strategy using ANN-NADAM for model fitting and updating. Referring to Fig. 17, consider first a target of 95% stability convergence shown by the dashed black line. The number of measured points required to approach 95% stability convergence varies greatly by strategy. The RAN updating strategy never reaches 95% stability convergence after 1,000 measured points over 10 iterations. After 10 iterations, approximately 50% of the grid has been validated by measurement but only 89.0% convergence is achieved. RAN-DK, CTH-DK and FTC-LS-DK approach 95% target accuracy with significantly fewer measured points, plus contributions by domain knowledge points. The addition of (unmeasured and no cost) domain knowledge points to the updating strategy enables 92.2% convergence after 410 measured points over 10 iterations for RAN-DK, 93.2% convergence for CTH-DK after 10 iterations with 422 measured points, and 92.9% convergence for FTC-LS-DK after only 322 measured points in 6 iterations. The trade-off between stability convergence CSLD and number of measured points with target accuracy of 90% is also shown in Fig. 17 by a solid green line. Again, FTC-LS-DK is the most efficient updating strategy from a measurement cost perspective after 6 iterations, achieving 90% convergence with only 180 measured points. Results for a target accuracy of 85% are also shown in Fig. 17 by a dashed blue line.

Fig. 17
figure 17

Trade-Off Between Predictive Accuracy and Measurement Cost – ANN-NADAM

From a practical perspective, the updating strategy of choice would provide the most improvement in CSLD with the fewest points—and also the most quickly. The speed of convergence can be seen in the shapes of the curves in Fig. 17. The updating strategies display different geometric behaviors as CSLD increases. The RAN and RAN-DK strategies display an approximately linear relationship between CSLD and the number of measured points, as expected since each iteration starts with a selection of 100 randomly sampled points, with RAN-DK measuring fewer points at later iterations. The CTH-DK strategy displays a convex relationship between CSLD and the number of measured points, with small increases in CSLD at early iterations even though larger numbers of points are measured. The FTC-LS-DK strategy, in contrast, displays a concave relationship between CSLD and the number of measured points. Stability convergence, CSLD, for FTC-LS-DK increases more quickly during early iterations with fewer measured points overall but including local search points with high information value in close proximity to the true SLD—an advantage on the factory floor where low costs are a priority.

Finally, the contribution of domain knowledge, in addition to measured points discussed above, helps to explain the comparative performance of the updating strategies. Figure 18 summarizes the total numbers of measured and domain knowledge points over all iterations for the four updating strategies. By design, RAN does not leverage domain knowledge. However, each of the three other updating strategies benefits from the contribution of domain knowledge to the predictive accuracy of the PGML model, and the faster convergence of the PGML SLD to the true SLD. The superior performance of FTC-LS-DK, discussed earlier, can be attributed to the contributions of both measured points and domain knowledge points located spatially in the region of the true SLD due to the sampling strategy that is guided by the physics-based model.

Fig. 18
figure 18

Points Validated by Measurement and Domain Knowledge per Updating Strategy (All Iterations)

Conclusions

This research applies a new approach, physics-guided machine learning or PGML, to modeling dynamic stability for process control to avoid chatter during milling. The simulation experiments performed here address the problem of estimating the true—and unknown—stability boundary or SLD by updating the physics-based stability model with physics-based theory to guide the sampling strategy, combined with measurement data and domain knowledge. The PGML approach proposed offers the ability to improve productivity and part quality by selecting operational parameters that avoid chatter while minimizing production downtime required for data collection.

Simulation experiments were performed in which the PGML stability model was trained using an initial approximation of the physics-based model (with errors) and then updated with simulated measurement data (without errors) to approximate the true stability model specific to the operational environment. Four strategies to update the PGML model were explored—each strategy reflecting a different approach to incorporating both physics-based theory, real-time measurements and domain knowledge. All four updating strategies improved the predictive capability of the baseline physics-based model, with improvements ranging between 8.1% and 17.3%, depending on the ML method used, to achieve prediction accuracy approaching 95%. Updating strategies that leveraged both the information value of each measured and updated point and non-measured domain knowledge points, achieved the largest gains in predictive accuracy and converged most quickly to the underlying true stability boundary.

Further, all four updating strategies demonstrated convergence to the true underlying SLD, even when measurement points were selected randomly. The inclusion of domain knowledge in addition to measurement data for the Random with Domain Knowledge (RAN-DK), Climb-the-Hill with Domain Knowledge (CTH-DK) and Follow-the-Curve with Domain Knowledge and Local Search (FTC-LS-DK) strategies resulted in further improvements in both the magnitude and speed of convergence in all experiments—demonstrating the value of domain knowledge to data-driven stability modeling. Further, the choice of updating strategy mattered less than the fact that domain knowledge was included. Among the three updating strategies with domain knowledge, all demonstrated comparable convergence capability—approximately 15% improvement in convergence to the true SLD.

Follow-the-Curve (FTC-LS-DK) achieved the fastest convergence with the fewest measured points—suggesting its suitability for use in a production environment where measurement costs are a concern. Follow-the-Curve (FTC-LS-DK) leveraged physics-based knowledge about the approximate location of the true underlying SLD to obtain a faster rate of convergence at early iterations. A practical limitation of FTC-DK-LS is that a priori knowledge of the physics-based SLD is required to guide the selection of points to be sampled for measurement. Neither RAN-DK and CTH-DK nor require a priori knowledge of the physics-based SLD yet achieved comparable predictive accuracy, although with more measurement cost. Finally, the RAN-DK, CTH-DK and FTC-LS-DK strategies all approached 95% convergence to the true SLD over their respective complete iteration cycles. The proposed methods are currently being validated with experiments using real data obtained during the milling process in a laboratory setting. Work is continuing to refine the updating strategies reported here to further improve chatter prediction and reduce measurement cost. In particular, in-process knowledge of chatter occurrence while milling can allow adaptive updating strategies for choosing the next points for measurement. In addition, other ML methods such as hybrid models (Deshmukh & Bhosle, 2018) are being explored for more efficient baseline training.

The PGML approach shows promise in bridging the gap between theory and practice in chatter avoidance. Despite technical advances in process control for machining processes, managing dynamic instabilities during machining remains a challenge and an impediment to increased productivity and more consistent surface quality for machined parts. The true stability state and its SLD are unknown in practice, and machine operators typically resort to experience and/or manufacturer’s recommendations in setting the feeds and speeds to avoid chatter. Further, when chatter is experienced, the operator typically stops production to adjust operating parameters adding time and cost—and may need to remove the workpiece for later rework depending on the degree of surface roughness. The approach introduced here offers foundational support for future self-aware machining where a machine’s ability to discover its own SLD allows self-adjustment independent of the operator. This capability depends on self-knowledge of both the machine’s true and current stability states, plus a mapping of process, operational, and machining parameters to stable and unstable stability states during operation that allow it to maintain a stable state or return to a stable state in the event of chatter. Challenges that remain, and that are the next steps for this research, are continued development of the PGML approach and updating strategies for milling within a time-varying and continuous learning environment with the goal of enabling the capture, characterization, and prediction of the machining process in real time for machine-initiated chatter avoidance and corrections.