1 Introduction

In the field of materials engineering, constitutive equations are indispensable for understanding and predicting material behavior. In mechanics, constitutive equations link the strain in a material to the stress it experiences by providing a mathematical description of how a material deforms under load and the subsequent stresses that develop in response. The complexity of these equations can vary significantly, from simple linear elastic models to highly sophisticated elastoplastic or elastoviscoplastic models that account for plastic deformation and hardening effects as well as rate-dependent effects for the case of viscoplasticity. Among the tools used to describe these phenomena, yield functions are of paramount importance. They serve as a basis for constitutive models of plastic material behavior. Von Mises [1] first introduced a yield criterion, based on the hypothesis that the onset of plastic deformation occurs when the \({J}_{2}\) invariant of the Cauchy stress tensor reaches a critical value. However, the von Mises criterion is only valid for materials that exhibit isotropic plasticity and, as it is based on the deviatoric stress, the yield behavior is insensitive to hydrostatic pressure. For pressure-dependent materials, alternative yield criteria such as the Drucker-Prager [2] models are more appropriate. Despite the proven efficacy of classical constitutive equations in predicting material behavior, these models are often constrained by their inherent assumptions. Additionally, expert intervention is frequently required to modify these models to address adaptability to a wider range of scenarios. As a result, Machine Learning (ML)-based techniques that enable the generation of a surrogate model straight from data for describing complex material behavior present opportunities for improved efficiency and adaptability and thus have been investigated extensively [3,4,5,6,7,8]. The prediction capability of ML models is tightly bound to the nature of the training data; high-quality data sets can lead to models that precisely capture the intricacies of plastic deformation, while low-quality or insufficient data can result in poor performance. Consequently, the strategy for sampling the feature space and the generation of the training data sets is crucial for building accurate and robust ML models.

The following sections provide a detailed overview of various research studies that underscore the pivotal role of high-quality data and efficient data sampling strategies in enhancing the accuracy and robustness of ML models. Zhang and Mohr [9] highlighted the use of a neural network (NN) model to accurately represent the stress–strain response of a Levy–Mises solid with isotropic hardening. The study points to the necessity of reducing the size of the required training data as the feedforward models utilized in their study exhibited limited generalization capacity unless supplied with copious data, posing a challenge for their direct application in engineering materials. This further underscores the importance of sourcing high-quality data for training, possibly through virtual experiments via Representative Volume Element (RVE) analysis. Weber et al. [10] offered a novel perspective to address the challenges associated with data dependency in ML models. They introduced an NN model that integrates physical constraints, focusing on capturing elastoplastic material behavior. This method holds the potential to significantly enhance the model's generalization capabilities, even when the availability of training data is limited. Grytten et al. [11] developed an elastic–plastic constitutive model based on an adaptive distribution approach in the stress directions, recognizing that in the stress space, points tend to cluster in regions where the yield surface's gradient undergoes rapid changes. A space-filling algorithm in combination with the generalized full constrained (FC)-Taylor theory was then used to determine 690 deviatoric stress states at initial yielding. These stress states were later applied to calibrate the yield surface. Research conducted by Ibragimova [12] demonstrated the capability of neural networks trained with data from crystal plasticity (CP) simulations under unique monotonic loading conditions to accurately predict stress–strain curves and texture evolution of face-centered cubic metals. Their work employed Sobol sequences [13, 14] for data sampling that can offer an efficient way to uniformly, yet quasi-randomly, fill the domain. However, this approach resulted in a significantly large dataset, comprising 1,451,161 unique loading condition samples, to effectively train the ML model. Yang et al. [15] developed a data-oriented approach using NN models for constructing elastoplastic constitutive laws for isotopic materials. They utilized homogenized stress–strain data, extracted directly from numerical simulations conducted on an RVE. The average stress corresponding to a given average strain was numerically computed over 500 loading steps, using 22 distinct loading directions in the principal stress space. The proposed approach by Sun and Vlassis [16] utilizes a data augmentation technique that multiplies each original data point into several new ones via the creation of a signed distance function level set. The transformed dataset is used to train an NN model with a specific loss function. This method reduces the stress representation to principal stresses only, thereby decreasing the dimensionality of the input space. Furthermore, their methodology uses a polar coordinate system for yield surface interpolation and applies 140 different Lode angles to partition the \(\pi \)-plane. Based on the work of Shoghi and Hartmaier [17], an optimal sampling strategy was introduced for uniform sampling of the yield locus in six-dimensional (6D) stress space, using a Monte Carlo and Fibonacci sequence-based strategy. This method facilitated the training of a Support Vector Classification (SVC) model as a yield function with high-quality data. It was shown that 300 loading directions were sufficient to provide a good data-based representation of the yield locus, even for severe anisotropic cases.

The methodologies described, while effective, can sometimes be data-intensive or may not always ensure the optimal selection of informative data for model training. Moreover, these methods often result in the inefficient utilization of computational resources and time, especially for complex materials and higher dimensional feature spaces. In the context of these limitations, active learning methods offer a considerable potential for improving the data sampling process by iteratively guiding the selection of new data in the regions of feature space where the trained model has the largest uncertainties. By strategically choosing the most informative sampling points, it can contribute to building more accurate ML models for yield function prediction with fewer data points. This not only reduces the computational overhead but also allows for an enhanced generalization capability of the models. Among different active learning scenarios in literature, the idea to query for new data points, instead of labeling from a pool or stream of data, was first introduced by Angluin [18] and relies on the learners’ label request at any possible location in the input space. One of the most popular query strategies used in the materials science domain is called uncertainty sampling and is based on probabilistic models, such as Gaussian processes [19,20,21]. New data points are selected at locations where the Gaussian process model predictions show the highest standard deviation and, thus, are most uncertain. However, in this work, the aim is to train SVC models using active learning, which requires a method that incorporates SVCs directly in the active learning loop instead of using Gaussian processes. Therefore, we apply the Query-By-Committee (QBC) algorithm, which was first introduced by Seung et al. [22] as a query selection framework which allows a committee of trained models to vote on the label of a candidate. The most informative query is the instance they most disagree on. Morand et al. [23] pointed out the advantage of enabling the training of arbitrary learning models by QBC and first showcased it’s usage in the materials science domain on the basis of NN models. Using the same method, Wessel et al. [24] employed QBC for efficient sampling of virtual experiments considering the full stress state, which is then used for identifying parameters of anisotropic yield models. A Gaussian process-based approach, however, has also been introduced by Wessel et al. for the reduced stress space in an earlier work [25].

The scientific contribution of the present paper is as follows: we address a fundamental hurdle faced by all data-oriented ML methodologies—the need for high-quality, representative, and optimal training data sets. These optimal data sets are paramount in training accurate ML-based constitutive models, especially where data acquisition is computationally expensive. The novel aim of this study is to overcome this challenge by introducing an active leaning-based approach, specifically the QBC algorithm for training an ML yield function using SVC model. Embracing a 6D stress space, our method extends beyond the conventional 3D principal stress space to provide a more general description of the yield function. The active learning strategy surpasses the limitations of static learning methods, enabling the selection of more informative data for model training. We are particularly interested in investigating whether this improves the training process and overall performance of the ML model and whether a more strategic active learning approach could decrease the reliance on large quantities of training data. Furthermore, this study intends to further enhance the current understanding of the sampling process within the stress space and investigate whether active learning tends to target specific regions of the 6D stress space. Moreover, as part of this study, we introduce and strive to establish a dynamic stopping criterion for the active learning process, which leads to a more efficient use of resources and a finely tuned control over the learning process.

2 Methods

In this section, a comprehensive explanation of the ML yield function and the principle of active learning is provided, with the goal of integrating these two fundamental elements to optimize training of SVC-based yield functions. The ML yield function is detailed first, and then the principles of active learning are delved into, focusing on how it optimizes training by selecting the most informative data points. Among these principles, the QBC technique is thoroughly explained. Finally, we discuss the integration of active learning with ML yield functions, aiming to boost training efficiency and enhance prediction accuracy in SVC. While these methods might seem distinct at first glance, they are intricately interconnected, forming a cohesive and innovative approach to optimizing the learning process of SVC-based yield functions.

2.1 Machine learning yield function

The elastic–plastic deformation of materials highlights the interdependent relationship between the applied force on a material and the ensuing deformation. The yield function is a theoretical framework that demarcates the transition from elastic to plastic deformation—a point where the equivalent stress aligns with the materials’ yield strength, which can be seen as

$$f\left({\varvec{\sigma}}\right)={\sigma }_{eq}\left({\varvec{\sigma}}\right)-{\sigma }_{y}$$
(1)

Here, \({\sigma }_{y}\) is the yield strength and \({\sigma }_{eq}\left({\varvec{\sigma}}\right)\) is the equivalent yield stress, which in this work follows the definition of Hill for anisotropic materials as

$$\begin{aligned}{\sigma }_{eq}\left({\varvec{\sigma}}\right)&=\frac{1}{\sqrt{2}}\big[{H}_{1}{\left({\sigma }_{1}-{\sigma }_{2}\right)}^{2}+{H}_{2}{\left({\sigma }_{2}-{\sigma }_{3}\right)}^{2}\\&\quad+H_{3}{\left({\sigma }_{3}-{\sigma }_{1}\right)}^{2}+6{H}_{4}{\sigma }_{4}^{2}+6{H}_{5}{\sigma }_{5}^{2}+6{H}_{6}{\sigma }_{6}^{2}\big]^{1/2}\end{aligned}$$
(2)

The components \({\sigma }_{1}, {\sigma }_{2}\) and \({\sigma }_{3}\) represent the normal stresses in three mutually orthogonal directions and \({\sigma }_{4}, {\sigma }_{5}\) and \({\sigma }_{6}\) denote the three independent shear stresses. Each coefficient \({H}_{i}\) modulates the influence of its corresponding stress component, thereby encapsulating the material's anisotropic response in that specific stress direction or plane. Notably, in cases where all the \({H}_{i}\) coefficients equate to one, the equation aligns with isotropic behavior, and the equivalent stress definition resonates with the isotropic von Mises (\({J}_{2}\)) criterion. Given that a symmetric stress tensor can be completely described by six independent stress components, our study harnesses this 6D stress space to develop and train an ML-based yield function.

When the yield limit is reached, \(f=0\), the material no longer reverts to its original shape and size upon release of the applied stress, signaling the onset of plastic deformation. Since the focus of this study is solely on onset of plastic yielding, we will assume ideal plasticity where the yield strength remains constant, independent of the material’s deformation history. This assumption simplifies the model by disregarding work hardening effects. Suggested by Hartmaier [26], the yield function can be characterized in a data-oriented approach, utilizing an ML algorithm known as Support Vector Classification (SVC). To effectively train such SVC model to serve as an ML yield function, it is crucial to provide a training dataset comprising stress tensors, each labeled with its corresponding state as either elastic or plastic. Once trained the SVC algorithm can then classify any given stress tensor \({\varvec{\sigma}}\) into distinctive "elastic" \({f}_{ML}\left({\varvec{\sigma}}\right)= -1\) and "plastic" \({f}_{ML}\left({\varvec{\sigma}}\right)=+1\) regions, thereby creating a comprehensive yet discernible map of the material's behavior under different stress conditions. The primary objective of this approach is to establish an optimal hypersurface, the yield locus, which serves as the definitive boundary separating the elastic and plastic regions. For the SVC model, the yield function is formulated as

$${f}_{ML}\left({\varvec{\sigma}}\right)=\sum_{i=1}^{{N}_{SV}}{\alpha }_{i}{y}_{i}\psi \left({\varvec{\sigma}}{,{\varvec{\sigma}}}_{i}\right)+b$$
(3)

where \({N}_{SV}\) is the number of support vectors, which are the critical data points within the training set that lie closest to the decision boundary. \({\alpha }_{i}\) represent the dual coefficients which are determined by solving the dual optimization problem. If \({\alpha }_{i}>0\), the corresponding data point \({({\varvec{\sigma}}}_{i})\) is a support vector, actively contributing to the decision boundary. If \({\alpha }_{i}=0\), the data point \({({\varvec{\sigma}}}_{i})\) does not influence the decision boundary. \({y}_{i}\) are the labels of the training data points chosen as support vectors. \({y}_{i}\) is +1 if the data point belongs to the positive class and −1 if it belongs to the negative class. \(b\) is the bias term, adjusting the position of the decision boundary. For the nonlinear problem at hand, the radial basis function (RBF) kernel \(\psi \left({\varvec{\sigma}}{,{\varvec{\sigma}}}_{i}\right)=\mathrm{ exp}\left(-\gamma {\Vert {\varvec{\sigma}}-{{\varvec{\sigma}}}_{i}\Vert }_{2}^{2}\right)\) is chosen, where \({\Vert .\Vert }_{2}\) denotes the Euclidean norm. The reason for choosing RBF kernel was due to its inherent flexibility and proven efficiency in handling complex, non-linear relationships present in the data, making it an optimal choice for our high dimensional classification problem, see e.g. [27, 28]. The parameter \(\gamma \) determines the width of the kernel function and, consequently, the range of impact of a single training point. A smaller \(\gamma \) value corresponds to a more localized influence. To find the optimal hyperplane and construct the decision boundary, the following dual optimization problem needs to be solved.

$$\underset{\alpha }{\text{max}}\left(\sum_{i=1}^{{N}_{SV}}{\alpha }_{i} - \frac{1}{2}\sum_{i=1}^{{N}_{SV}}\sum_{j=1}^{{N}_{SV}}{\alpha }_{i}{\alpha }_{j}{y}_{i}{y}_{j}\psi \left({{\varvec{\sigma}}}_{i}{,{\varvec{\sigma}}}_{j}\right)\right)$$
(4)

Subject to constraints:

$$ \begin{gathered} \mathop \sum \limits_{i = 1}^{{N_{SV} }} \alpha_{i} y_{i} = 0 \hfill \\ 0 \le \alpha_{i} \le C \hfill \\ \end{gathered} $$
(5)

where C penalizes any misclassified data points. A smaller C value implies a less severe penalty for misclassified points, leading to the selection of a wider-margin decision function at the boundary, even though it may result in a greater number of misclassifications. Conversely, a larger C value instructs the training algorithm to restrict the number of misclassified cases by applying a large penalty and a smaller decision boundary. Grid search is a commonly used method to find the optimal hyperparameters C and \(\gamma \). It involves searching exhaustively through a specified subset of hyperparameters and selecting the combination that yields the best performance according to a pre-defined metric. Solving the dual problem yields the dual coefficients \({\alpha }_{i}\) and provides the necessary information to construct the decision boundary defined in Eq. 3 [29].

For training an ML yield function, it is essential to provide critical stresses that indicate the start of plastic deformation and use those yielding stresses to generate training data in elastic and plastic regions of the stress space. This process involves sampling points on the surface of a 6D unit sphere within the corresponding stress space. The next step involves determining a scalar multiplier for each unit stress direction by employing a root-finding method with the reference material's yield function. Solving for the zeros of the yield function identifies the multipliers, which, when applied to the unit stress tensors, result in stress states that lie precisely on the yield surface. These yield stresses serve as the basis for generating training data across the elastic and plastic domains of the stress space. Within the elastic region, the magnitude of these stresses is reduced by using a set of 25 linearly spaced multipliers, ranging from 0.1 to 0.95. This scaling guarantees that the stress states remain within the yield surface, with each state being labeled as “elastic” and assigned a numerical value of −1. In the plastic domain, the yield stresses are augmented using an array of 25 linearly spaced multipliers, ranging from 1.05 to 2, to position them in the plastic domain. These amplified stress states receive the label "plastic" and are assigned a numerical value of +1. Through this approach, a labeled data set is generated with the following structure:

$$\left\{\begin{array}{ll}{x}_{t}=\left[{\sigma }_{1},{\sigma }_{2},{\sigma }_{3},{\sigma }_{4},{\sigma }_{5},{\sigma }_{6}\right],&\quad {y}_{t}= +1\quad \text{plastic if}\; f\left({x}_{t}\right)\ge 0 \\ {x}_{t}=\left[{\sigma }_{1},{\sigma }_{2},{\sigma }_{3},{\sigma }_{4},{\sigma }_{5},{\sigma }_{6}\right],&\quad {y}_{t}= -1 \quad \text{elastic if}\; f\left({x}_{t}\right)<0\end{array}\right.$$
(6)

In this context \({x}_{t}\) represents the feature vector in form of a scaled stress tensor and \({y}_{t}\) indicates the labels. As a result, for each unit direction, 50 stress tensors are generated—25 within the elastic region and 25 within the plastic region—each representing a labeled data point. This procedure results in a suited dataset that captures a wide spectrum of stress states, providing the ML model with the necessary information to accurately distinguish between elastic and plastic regions of the stress space. Following this step, the prepared training data can be used for ML training. Figure 1 offers a simplified schematic representation of the method for generating and labeling training data within a two-dimensional stress space, under plane stress condition. It is important to note that this illustration depicts equivalent stresses for clarity, whereas the actual method employs the full 6D stress tensor.

Fig. 1
figure 1

illustration of training data generation and labeling. a Initial identification of yield stress states: Each orange dot indicates a unit stress direction within the 2-dimensional stress space. Corresponding scalars are computed for these unit tensors to locate the stress states on the yield surface, illustrated by yellow dots, defining the onset of yielding. b Training data generation through scaling: Red dots signify stress states scaled beyond the yield surface, representing the plastic state. Conversely, blue dots show stress states scaled within the yield surface, indicative of the elastic state. c Training the ML model: The application of the labeled data to train the ML model. The resulting hyperplane, learned by the model, serves as the ML yield function, effectively differentiating between the elastic and plastic regions within the stress space

Convexity is an important characteristic for a yield function to ensure it represents material behavior accurately. This convexity ensures physically consistent behavior, preventing any unpredictable transitions or non-physical manifestations. By utilizing optimal hyperparameters, the SVC algorithm can maintain the convexity observed in the training data when forming the decision boundary. As a result, SVC is capable of naturally reflecting this crucial characteristic without requiring any additional enforcement of convexity constraints. This characteristic makes SVC a suitable ML algorithm for developing models intended to serve as yield functions, ensuring both accuracy and adherence to the necessary physical properties.

Given that the ML yield function is defined as convolution sum over the support vectors, the gradient to the SVC decision function can be calculated as

$$\frac{\partial {f}_{ML}\left({\varvec{\sigma}}\right)}{\partial{\varvec{\sigma}}} = \sum_{i=1}^{{N}_{SV}}{-2\gamma \alpha }_{i}{y}_{i}{\text{exp}}\left(-\gamma {\Vert {\varvec{\sigma}}-{{\varvec{\sigma}}}_{i}\Vert }_{2}^{2}\right)\left({\varvec{\sigma}}-{{\varvec{\sigma}}}_{i}\right)$$
(7)

from which the plastic strain increments can be derived directly for plasticity models based on a normality rule, which are typically used within a standard finite element formulation.

2.2 Active learning principles

Unlike the most conventional learning methods where the training data remains static, active learning involves an iterative, targeted selection of training examples based on the current state of the learning algorithm. In this paradigm, the learner (algorithm) is no longer a passive recipient of data but rather an active participant in its collection [30].

In the context of ML, the concept of version space plays a crucial role. As introduced by Mitchell [31] it refers to the set of all hypotheses or models that can explain the data seen so far, as shown in Fig. 2. Essentially, the version space encompasses all possible hypotheses or models that can accurately explain the data seen so far. As more data or training examples are observed, hypotheses that are inconsistent with the new data are removed from the version space. The remaining hypotheses after all the data has been observed are those that are consistent with all the training examples. This is where the active learning approach exhibits its strength. By strategically requesting the most informative examples, the learning algorithm can effectively navigate the version space, aiming to find the hypothesis or model that not only fits the training data but also generalizes well to unseen data. The active learning strategy, such as Query-By-Committee, aims to efficiently constrain this version space, enabling a more precise search [30].

Fig. 2
figure 2

Version space example for a linear classifier. All hypotheses are consistent with the labeled training data but each represents a different model in the version space [30]

The QBC method, first introduced by Seung et al. [22], offers an approach to active learning that seeks to minimize the inherent prediction uncertainty of an ensemble of models. The main concept of the QBC method is having a committee of models, of which each is trained on currently available data. In the context of classification, each committee member is allowed to vote on the categorization or label of a new data instance. The novel data instance that elicits the most disagreement or conflicting votes among the committee, quantified by a voting entropy measure or similar metric, is chosen for labeling. This process is conducted iteratively, with each new labeled instance being added to the data set, serving to refine and improve the models within the committee. The guiding principle of this method is the optimization of anticipated information-gain from querying a new instance. It operates under the assumption that the instances causing the most disagreement within the committee are likely to provide the most valuable learning insights. In active learning, obtaining new data instances within the design or feature space can be achieved by using a variety of methods. Fundamentally, there are three different data selection strategies: (i) Pool-based sampling, where models evaluate a provided pool of unlabeled data. (ii) Stream-based selective sampling, in which models evaluate data instances as they appear in a data stream. (iii) Membership query synthesis, where models can freely evaluate locations within a specified feature space, thereby independently creating new data instances and querying their label. Each of these strategies focuses on efficiently navigating the feature space, identifying, and labeling the instances that cause the most disagreement among the committee of models [30].

When a new data instance is procured, the crucial question of how to measure the level of disagreement among the models in the committee arises. Cohn et al. [32, 33] suggested using variance as a metric to quantify this disagreement. By generating queries that minimize this variance, the potential for future prediction errors can be reduced. As suggested by Krogh et al. [34], variance can be defined as:

$${s}^{2}\left(x\right)=\frac{1}{N}\left(\sum_{\eta =1}^{N}{\left({f}_{\eta }\left(x\right)-\overline{f }\left(x\right)\right)}^{2}\right)$$
(8)

In this equation N denotes the number of committee members, \({f}_{\eta }\left(x\right)\) is the prediction of the ηth model and \(\overline{f }\left(x\right)\) is the mean over all predictions at location x. Based on this formulation the next location to query is determined by solving an optimization problem and finding the new data point in the feature space, i.e., stress space, at which the variance among the committee members is maximized as

$${x}^{*}=\underset{x}{{\text{arg max}}}\left({s}^{2}\left(x\right)\right)$$
(9)

In this approach we generate training data from unit stresses in a 6D stress space encompassing both normal and shear components as detailed in Sect. 2.1. Each unit stress is then systematically escalated until reaching the zero of the yield function indicating the onset of plastic yielding. This approach produces a comprehensive collection of 6D stress tensors right at the threshold of plastic yielding. Given this methodology, it becomes essential to sample points on the surface of a unit sphere in the 6D stress space. Some prior studies, such as the work by Wessel et al. [24], have sought to ensure sampling on a unit sphere using a soft constraint. However, this method is not efficient due to the significant computational resources required to probe infeasible areas of the search space, specifically points outside the unit sphere. To overcome this problem, in this study a novel method is proposed based on transforming the problem so that potential solutions naturally lie on the surface of the 6D unit sphere. This transformation is achieved by shifting from Cartesian to spherical coordinates. In spherical coordinates, one parameter represents the radius (r), and five others correspond to angles \(\left({\theta }_{1},\dots ,{\theta }_{5}\right)\). For a unit sphere, the radius is invariably fixed at 1, meaning only the five angles need to be optimized. Consequently, any set of potential angle values corresponds to a point on the sphere, eliminating the requirement for a penalty term to enforce adherence to the constraint. In this approach, the task of optimization focuses on finding the values of the angles that maximize the objective function. By incorporating the constraint directly into the problem formulation, this method enables a more efficient and reliable optimization. In this case, the objective function can be described as:

$${\theta}^{*}={\text{arg}}\,\mathop{\max}\limits_{\theta}\left({s}^{2}\left(\theta \right)\right)$$
(10)

To solve this equation, Storn and Price [35] proposed the Differential Evolution (DE) algorithm. This algorithm, typically utilized for the minimization of functions in continuous space, holds a significant advantage due to its inherent ability to find global minima in optimization problems. In the DE algorithm, an initial population of candidate solutions is generated in the bounded search space. Each solution vector (a so-called individual) in the population represents a potential solution to the optimization problem. During the mutation phase, the algorithm creates a mutant vector for each individual by combining the vectors of three other distinct individuals from the current population. The resultant mutant vector is a potential candidate for the next generation. Following mutation, a crossover operation is performed to create a trial vector by mixing the mutant and target vectors. The extent of this mixing is governed by a crossover rate parameter. In the selection phase, a greedy strategy is applied where the trial vector competes with the original individual. The one that provides better fitness according to the objective function is selected to proceed to the next iteration (so-called generation). This process of mutation, crossover, and selection is repeated for a specified number of iterations or until a stopping criterion is met. As a result, the DE algorithm gradually evolves the population towards the optimal solution. After sampling the in spherical coordinates, the transformation to Cartesian coordinates on the unit sphere is given by:

$$ \begin{aligned} x_{1} = & \cos \left( {\theta_{1} } \right) \\ x_{2} = & \sin \left( {\theta_{1} } \right)\cos \left( {\theta_{2} } \right) \\ x_{3} = & \sin \left( {\theta_{1} } \right)\sin \left( {\theta_{2} } \right)\cos \left( {\theta_{3} } \right) \\ x_{4} = & \sin \left( {\theta_{1} } \right)\sin \left( {\theta_{2} } \right)\sin \left( {\theta_{3} } \right) \cos \left( {\theta_{4} } \right) \\ x_{5} = & \sin \left( {\theta_{1} } \right)\sin \left( {\theta_{2} } \right)\sin \left( {\theta_{3} } \right) \sin \left( {\theta_{4} } \right) \cos \left( {\theta_{5} } \right) \\ x_{6} = & \sin \left( {\theta_{1} } \right)\sin \left( {\theta_{2} } \right)\sin \left( {\theta_{3} } \right) \sin \left( {\theta_{4} } \right) \sin \left( {\theta_{5} } \right) \\ \end{aligned} $$
(11)

where \({\theta }_{1},{\theta }_{2},{\theta }_{3},{\theta }_{4}\) are in range \(\left[0,\pi \right]\), and \({\theta }_{5}\) is in range \(\left[\mathrm{0,2}\pi \right]\).

In this work, we apply the methodology proposed by Raychaudhuri et al. [36], which involves training models on a random subset of the initial training dataset. This approach is particularly vital when working with SVC models. These models, when trained with identical hyperparameters on the same dataset, yield identical results, which contradicts the requirements for the QBC approach. The QBC approach relies on variation among the committee members' predictions. To introduce this variation, each committee member is assigned a random subset of the dataset in each active learning iteration. A comprehensive grid search is performed in each iteration for each of these models in the committee to find the optimal hyperparameters.

2.3 Active learning for machine learning yield function

Following the work of Shoghi and Hartmaier [17] for training an ML yield function using SVC, in our study, we considered two reference materials that exhibit different levels of complexity: (i) an isotropic material, characterized by Hill coefficients of one, and (ii) an anisotropic material, defined by a varying range of Hill coefficients. The exploration of anisotropic material behavior using this methodology provides the possibility to test the model's ability to generalize across diverse material behaviors, thereby evaluating its reliability and applicability in broader contexts. Both materials are defined using the open-source package pyLabFEA [37] which introduces a simple version of finite element analysis for solid mechanics and elastic–plastic materials, fully written in Python. In accordance with the study conducted by Shoghi and Hartmaier [17] the parameters employed for defining both materials are summarized in Table 1.

Table 1 Parameters defined for the isotropic and anisotropic reference material

The described reference materials are used for creating training data for ML training, which is accomplished by first creating unit stresses in 6D stress space with normal and shear components. As mentioned in Sect. 2.1, each unit stress is then increased proportionally until the yield function of the stress tensor reaches zero. At this point, plastic yielding begins for the specific load case. The collection of full 6D stress tensors at the onset of plastic yielding is compiled to represent the yield function in a data-oriented manner and forms the ground truth for training the ML yield function. Using these yield stress tensors as basis, the training data across both elastic and plastic domains is generated, respectively labeled −1 and +1, through methods of downscaling and upscaling, as described in Sect. 2.1. The significance of this 6D stress space representation lies in its ability to capture the complexity of material behavior under various stress conditions, paving the way for more realistic and detailed yield function descriptions. It should be noted that no hardening is considered in this work and the goal is representing the initial yield locus. To initialize the active learning procedure, an initial training set of stresses is generated using the random function available in the NumPy package [38], which generates a random array with 6 components from a uniform distribution within (−1, +1) range. By normalizing, we make sure that any generated data point is located on the surface of a unit 6D sphere. Following this, a committee of five Support Vector Classifiers is trained, each receiving a random subset of the training data, which constitutes 80% of the total dataset. The DE algorithm is then implemented, directed by a sampling scheme aimed at maximizing the variance in the committee's predictions. This optimal solution is used to generate the next unit stress tensor which, as detailed in Sect. 2.1, is scaled, labeled, and then added to the existing dataset. New unit tensors are sampled in areas where the committee's predictions disagree the most, thus facilitating the training of the models on more complex and less-explored areas of the problem space and enhancing the capabilities of the models. After each new unit tensor is sampled, and the training data set is updated, the committee is retrained, and the DE is performed again with the updated dataset. This iterative process continues until a certain termination condition, such as reaching a predetermined number of iterations, is met. Throughout this process, visual inspections of the trained yield functions and support vectors are conducted to assess the quality of the models. Additionally, the variance in the committee's predictions across iterations is monitored and graphed, providing insight into the effectiveness of the active learning procedure. The iterative procedure of QBC strategy used for defining an SVC model can be seen in Fig. 3. The algorithm for the active learning process can be seen in Table 2.

Fig. 3
figure 3

This plot shows the Query-By-Committee (QBC) active learning process. The cycle starts with the initial training Data Set, split into Data Subsets. Each subset trains a Support Vector Classification (SVC) model. Next, optimization seeks a new data point (represented by a small orange box) at the position in data space where the highest uncertainty in the SVC models representing the committee is observed. After labelling this new data point with respect to the reference material (small green box), it is added to the training Data Set. This iterative cycle then repeats until the stopping criterion is fulfilled

Table 2 Algorithm for QBC as is used in the present paper

2.4 Stop criteria

2.4.1 Variance and rate of performance improvement

In active learning, as opposed to its static counterpart, the establishment of well-defined stopping criteria is necessary to ensure optimal resource utilization and securing model efficiency. Margatina et al. [39] identified this challenge as the determination of the 'optimum' stopping point beyond which the model's learning is considered sufficient. According to their discussion, 'optimal' is largely domain-dependent and requires a careful balance between accuracy and cost. Bloodgood [40] investigated development of efficient and adaptable stopping methods. Based on this work, it was suggested to explore user-adjustable stopping criteria that account for different annotation/performance tradeoff valuations, which could offer users the flexibility to choose a stopping criterion that aligns best with their specific requirements. The assignment of a fixed number of iterations in active learning is rendered impractical and illogical due to two intrinsic characteristics of the process. Firstly, the training often commences with random data points of varying sizes, introducing a degree of unpredictability. Secondly, the algorithm employed for optimization in active learning typically exhibits a stochastic nature. Therefore, the criteria must adapt based on the current performance of the model, leading us to the question: when should the QBC strategy stop its search for new data points for our specific use case?

Variance, as a reflection of the uncertainty in the model predictions, is one of the measures that is typically considered. However, relying solely on variance as a stopping criterion presents a challenge—it neglects the critical aspect of data exploration and can be overly sensitive to minor fluctuations. Also, for a user, setting an a priori level of variance can pose a significant challenge. To mitigate these issues, we propose a more dynamic approach. Instead of defining the variance to reach a fixed minimum value, we suggest a dynamic threshold. This threshold is defined by the user as a desired percentage reduction from the maximum variance observed during the initial n iterations (in this study, we use n = 10), which can adapt based on the specific nature of the problem. This approach is particularly effective as it aligns with the inherent dynamics of the learning process; as it progresses, the disagreement among committee members is expected to reduce. This leads to a decrease in the prediction variance of the committee members, underscoring the relevance of a dynamically defined threshold. Additionally monitoring the rate of change in variance can serve as an additional stop criterion. This can provide a robust way to handle fluctuations and identify when the model has reached a point of stability similar to early stopping for neural network training [41]. By observing the variance's change from one iteration to the next, we can determine the rate of uncertainty reduction. When this rate falls below a specified threshold, indicating minimal decrease in variance, it is reasonable to stop the active learning process. By considering both the magnitude of the variance and the rate of change in variance it is possible to formulate a robust stopping criterion considering the general decreasing trend in variance without being overly influenced by minor fluctuations. The proposed dual stopping criterion is applicable not only to QBC learning processes but also to other approaches like Gaussian processes.

For the practical usage of this dual stopping criterion: firstly, the variance that quantifies the committee disagreement as described in Eq. 8 must fall below a predefined value, \({{\varepsilon }_{{\text{crit}}}}_{1}\) which can be chosen based on a desired percentage reduction from the maximum variance value observed during the first \(n\) iterations. For \(\alpha \) being the desired percentage reduction, \({{\varepsilon }_{{\text{crit}}}}_{1}\) can be defined as:

$$ \begin{gathered} \varepsilon_{{{\text{crit1}}}} = {\text{max}}\left( {s^{2} \left( x \right)_{0 } , s^{2} \left( x \right)_{1} , \ldots , s^{2} \left( x \right)_{n } } \right) \left( {1 - \left( {\alpha / 100} \right)} \right) \hfill \\ s^{2} \left( x \right)_{i} < \varepsilon_{{{\text{crit1}}}} ,\quad 0 < \alpha < 100\% \hfill \\ \end{gathered} $$
(12)

This allows the user to adapt the stopping criteria according to their specific needs and the characteristics of the data being used.

Secondly, the rate of change of the variance over a sequence of iterations must reach a critical threshold \({{\varepsilon }_{{\text{crit}}}}_{2}\), which denotes that the committee disagreement is not significantly decreasing anymore. This rate is computed as the difference between the maximum and minimum variances within the sequence, divided by the sequence length \(\Delta t\) (the number of iterations per sequence). Essentially, this measures the slope of the variance within the sequence as:

$$ \begin{aligned} &\frac{{\Delta s^{2} }}{\Delta n} = \frac{{\left( {\max \left( {s^{2} \left( x \right)_{i - n } , s^{2} \left( x \right)_{i - n + 1} , \ldots , s^{2} \left( x \right)_{i } } \right) - \min \left( {s^{2} \left( x \right)_{i - n } , s^{2} \left( x \right)_{i - n + 1} , \ldots , s^{2} \left( x \right)_{i } } \right)} \right)}}{\Delta n} \hfill \\ &\left| {\frac{{\Delta s^{2} }}{\Delta n}} \right| < \varepsilon_{{{\text{crit2}}}} \hfill \\ \end{aligned} $$
(13)

The learning process stops only when both conditions are met and can be summarized as:

$${\text{if }} {s}^{2}{\left(x\right)}_{i}< {{\varepsilon }_{{\text{crit}}}}_{1}\quad {\text{and}}\quad \left|\frac{{\Delta s}^{2}}{\Delta n}\right|< {{\varepsilon }_{{\text{crit}}}}_{2}:{\text{stop}}$$
(14)

2.4.2 Validation of the stopping criteria

Monitoring the variance and its rate of change can provide insights into the reduction of the committee disagreement and acts as a potential stopping criterion during the learning process. However, to validate the model's predictive performance and generalization ability, a testing process is crucial. This assessment aids in verifying the reliability of the variance-based stopping criterion. While initial testing validates the criterion's effectiveness, users can subsequently use the variance-based approach independently to determine when to halt the learning process. Such an evaluation process involves examining the model's actual predictive performance and its capability to generalize learning to unseen data. During testing, the definition of unbiased test cases is of paramount importance. The model's performance can be assessed using a proper score after each active learning iteration. A higher rate of improvement is typically expected at the start of the learning process, which is likely to slow down over time.

Here, we use the Confusion Matrix, which is a powerful tool often utilized in supervised ML to evaluate the performance of classification models. It provides a comprehensive overview of the prediction results compared to the actual classifications, neatly presenting both correct predictions and the types of errors made. The structure of a typical confusion matrix is shown in Fig. 4.

Fig. 4
figure 4

Confusion matrix for testing the performance of the trained classification model

The confusion matrix consists of four main components: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). True Positives are the correctly identified positive cases, and True Negatives are the correctly identified negative cases. On the other hand, False Positives are negative cases that were incorrectly identified as positive, and False Negatives are positive cases incorrectly identified as negative. These components provide valuable insights into the model's performance. For instance, a high number of True Positives and True Negatives indicates a model’s good predictive power. In contrast, a high number of False Positives and False Negatives indicates that the model is struggling to make accurate predictions. Based on the confusion matrix, different metrics such as accuracy, precision, recall, and F1 score can be calculated to evaluate the performance of the trained classification model. The Matthews Correlation Coefficient (MCC) is regarded as a superior metric for evaluating binary classifications in ML, particularly in situations involving imbalanced data sets. This is because MCC takes into account all values in the confusion matrix, rather than concentrating on a single dimension. The MCC is a correlation coefficient between the observed and the predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction, and −1 an inverse prediction. This makes it a balanced measure, even when the classes are of very different sizes. MCC is generally regarded as a balanced metric because it considers both over-predictions and under-predictions. For instance, a model with a high number of False Positives and False Negatives will be penalized in the MCC score [42]. The MCC score is calculated as

$$MCC=\frac{(TP)(TN)-(FP)(FN)}{\sqrt{\left(TP+FP\right) \left(TP+FN\right) \left(TN+FP\right) \left(TN+FN\right)}}$$
(15)

However, an effective application of the confusion matrix hinges on the careful definition of test cases. Creating these cases can pose a challenge, particularly in complex or nuanced domains. It is essential that the test cases accurately capture the diversity and distribution of the data the model is expected to encounter in real-world scenarios. Additionally, a balance between positive and negative cases is crucial to avoid a bias in the model's performance assessment. Poorly defined or biased test cases can lead to a skewed confusion matrix, compromising the reliability of the derived performance metrics. Therefore, establishing robust, representative test cases is critical to fully harness the evaluative power of the confusion matrix and to ensure a fair and accurate assessment of the model's ability to generalize its predictions. To facilitate a comprehensive evaluation, 5 test sets have been created, each including 600 test points situated closely to the decision boundary (hyperplane) as shown in Fig. 5. 300 of the test cases are in the elastic area and 300 in the plastic area, thus forming a balanced test set. These test cases will be used in all the evaluations of this work.

Fig. 5
figure 5

Extreme test cases in close vicinity of the decision boundary

For the test cases, a similar approach is taken to define randomly distributed unit vectors in the 6D stress space. Using the yield function, the scale in each direction to the ideal decision boundary (the critical point where a material undergoes a transition from elastic to plastic deformation) can be determined. To generate test data points that lie close to this decision boundary, the calculated scale is multiplied by factors of 0.99 and 1.01, respectively, which results in two additional sets of points—one just below the yield point (representing elastic states) and another just above the yield point (representing plastic states). The sets of test points form the near boundary test lines as depicted in Fig. 5. The proposed near-boundary test lines lie within a critical zone or margin that extends from 0.95 to 1.05, surrounding the hyperplane located at 1.0 approximated by the SVC model. The test line at 1.01 resides just above the decision boundary, falling within the upper part of this critical margin. Conversely, the test line at 0.99 is just below the boundary, occupying the lower part of the margin. These test cases located near the decision boundary often represent challenging instances for the model to correctly classify. Yet, these test cases offer invaluable insights into the model's capacity to generalize from the training data. If the model can successfully classify these instances, this strongly indicates high model performance and robustness. These near-boundary test lines are not only positioned strategically close to the decision boundary but are also randomly generated, thus introducing a valuable element of variability to the testing process, ensuring that the model's performance is evaluated across a diverse range of instances.

In evaluating the model's performance across iterations, the emphasis is not placed on the exact values. Rather, the focus is on the observable trends in these metrics over time. This approach is based on the understanding that inherent randomness in the learning process can cause fluctuations in the exact values, which may not necessarily reflect the overall learning trajectory of the model. By comparing trends instead of exact values, a more meaningful assessment of the model's learning progress and generalization ability can be achieved.

3 Results and discussion

3.1 Active learning results: the most informative points

3.1.1 Isotropic material

The training process, as detailed earlier, commenced with an initial training set of 100 random unit stresses in a 6D stress space. Figure 6 illustrates the results of the training after 200 iterations. This figure plots the \({J}_{2}\) equivalent stress of the stress tensors at the onset of plastic yielding against the polar angle of the stress tensor in the π-plane, which represents the deviatoric plane in the space of principal stresses. The yield function values are represented by two color schemes: orange and purple. Orange colors signify positive yield function values, indicative of plastic yielding, while purple colors denote negative values of the yield function, indicating that the stress is within the elastic regime. It is important to note that no color scale is provided since the absolute value of the ML yield function has no physical significance. The blue and black lines in the figure represent the stress points where the yield function is zero for the reference material. The blue line corresponds to an analytically formulated Hill-like yield criterion, while the black line corresponds to the trained ML yield function. The open symbols in the figure denote the support vectors identified during the training procedure.

Fig. 6
figure 6

The plot of the ML yield function for an isotropic material after a initial training using 100 data points, and b after 200 iterations. The yield function defined for a full 6D stress tensor is represented in cylindrical coordinates on the \(\pi \)-plane, representing the plane of deviatoric stresses in the principal stress space

As can be seen, a good agreement between the trained ML function (black line) and the Hill yield locus (blue line) can be observed after the final iteration. In Fig. 7, the evolution of variance and the MCC score over the course of iterations in the learning process is shown.

Fig. 7
figure 7

The change in variance (on a logarithmic scale) of the committee predictions (blue) and the MCC test score (red) of the trained model over 200 iterations of the active learning loop for an isotropic material. A line corresponding to the dynamic threshold \({{\varepsilon }_{{\text{crit}}}}_{1}\), reflecting a 99.98% reduction from the maximum initial variance, is superimposed to visually represent the first stopping criterion

Each data point, represented by a blue dot, corresponds to the variance in each iteration. The variance values have been displayed on a logarithmic scale to allow for a detailed understanding of the alterations, specifically for lower variance values. A general diminishing trend in variance was noticed as the iterations increased, signifying a decrease in the prediction uncertainties as training progressed. To smoothen out short-term fluctuations and highlight longer-term trends or cycles, the moving average over a window of 10 iterations has been computed for variance, signifying a gradual decrease in the model's uncertainty as training progressed. This is represented by a solid blue line, delineating the reduction in the logarithmic variance throughout the active learning process.

The line corresponding to \({{\varepsilon }_{{\text{crit}}}}_{1}\) which represents the dynamic threshold for committee disagreement, is superimposed on the plot. This threshold, set at 10−2, reflects a 99.98% reduction from the maximum variance of 50, observed during the initial 10 iterations. This line is depicted in the figure to visually represent the first stopping criterion based on the defined threshold.

The MCC is demonstrated in red. The moving average over 10 iterations has been calculated for MCC, indicating an overall positive trend in the predictive power of the model, which is represented by a solid red line, demonstrating the gradual improvement in binary classifications of the model over time. The x-axis labels are positioned at multiples of 10, enabling a clear distinction of the corresponding variance and MCC for every tenth iteration. From the graph, it is evident that as the number of iterations increased, the variance decreased while the MCC improved, validating the effectiveness of the active learning process in our model. It should be noted that reaching an MCC of 1 is not always possible or even expected in many real-world situations. Also, this might not be a consistent outcome, even under the same conditions.

It is crucial to note that due to the inherent randomness in the learning process, particular attention should be paid to the moving averages rather than the exact values of MCC scores. Owing to the stochastic nature of the initial training set selection, if the learning process were to be repeated, both the MCC score values and the rate at which the optimal score is reached could vary. This unpredictability underscores the importance of observing overall moving averages rather than focusing on precise metric values. The moving averages offer a more accurate estimation of the model's learning trajectory and its ability to generalize and make accurate predictions over time. By focusing on these trends, a more comprehensive understanding of the model's learning progression can be obtained, even amid the potential variability in the learning outcomes caused by the randomness in the process. This approach is vital for ensuring the reliable evaluation and interpretation of the model's performance throughout the active learning iterations. This suggests that the model's learning progression has either plateaued or drastically slowed. Further training under these conditions often leads to minimal performance improvements, making it an inefficient practice.

To address the critical need for establishing a stopping criterion, the rate of change of variance was calculated. This involved an analysis of variances over blocks of iterations, where each block comprised of 10 iterations. The minimum and maximum variance within each block were identified and used to calculate the slope representing the rate of change. This analysis is graphically represented in Fig. 8. On the x-axis, the iteration number is plotted, while on the y-axis, the calculated rate of change in variance is plotted. It can be assumed that the larger the slope, the more significant the improvement in the model over these iterations. Additionally, a line denoting the critical threshold \({{\varepsilon }_{{\text{crit}}}}_{2}\) value of 0.001 is also depicted. This critical slope line serves as a threshold, indicating when the rate of change in variance has reached the desired minimum level as per our second stopping criterion.

Fig. 8
figure 8

Assessment of the variance rate change used to determine a stopping criterion for iterative processes

In this case, after approximately 95 iterations, the rate of change of variance had reached the predefined critical value, and the variance itself had also fallen within the predefined acceptable range as shown in Fig. 7. This satisfied both our first and second stopping criteria. Thus, extending the training beyond this point would have likely yielded only marginal improvements, reinforcing the efficiency of our proposed stopping criteria.

3.1.2 Anisotropic material

To further validate the versatility and general applicability of the proposed active learning method, it was applied to an anisotropic material. This choice of material introduces additional complexity due to the direction-dependent properties, thus presenting a more challenging scenario for the method under consideration. The results shown in Fig. 9 display a good agreement between the ML-trained yield function (represented by the black line) and the anisotropic Hill yield locus (denoted by the blue line) after the final iteration.

Fig. 9
figure 9

Plot of the ML yield function for a plastically anisotropic material after a initial training using 100 data points, and b after 200 iterations. The yield function defined for a full 6D stress tensor is represented in cylindrical coordinates on the π-plane, representing the plane of deviatoric stresses in the principal stress space

In Fig. 10, the learning process in an anisotropic case is displayed. As in Fig. 7, variance is represented in blue, while MCC is shown in red. Variance, displayed on a logarithmic scale, is denoted by blue dots, each representing an active learning iteration. A general declining trend in variance is observed as the number of iterations increases, which signifies a reduction in the committee's disagreement throughout the learning process. This decreasing trend is represented by the solid blue line, computed as the moving average over a window of 10 iterations. A line representing the dynamic threshold \({{\varepsilon }_{{\text{crit}}}}_{1}\) is superimposed on the plot. This threshold, set at 10−2 reflects a user-defined reduction of 99.98% from the maximum initial variance, providing the acceptable level of variance according to our first stopping criterion. Red dots are used to denote the MCC values. The moving average over a span of 10 iterations is calculated for the MCC, illustrating a generally positive trend, reflecting a gradual improvement in the model's binary classification capability over the course of the learning process.

Fig. 10
figure 10

Change in variance of the committee predictions and the MCC test score of the trained model over 200 iterations of the active learning loop for an anisotropic material. A line corresponding to the dynamic threshold \({{\varepsilon }_{{\text{crit}}}}_{1}\), reflecting a 99.98% reduction from the maximum initial variance, is superimposed to visually represent the first stopping criterion

As in the previous case, the randomness inherent in the learning process requires the focus to be placed on the moving averages rather than on the exact MCC scores. Observing the overall moving averages provides a more accurate depiction of the model's learning trajectory and its capability to generalize and make accurate predictions over time.

Like in the previous case, the rate of change of variance was evaluated, which involved an examination of variances over sets of iterations, with each set comprising 10 iterations. The minimum and maximum variance within each set were discerned and utilized to compute the slope indicative of the rate of change. This analysis is visually presented in Fig. 11, where the iteration number is plotted on the x-axis, while the calculated rate of change in variance is depicted on the y-axis, allowing us to infer that a larger slope corresponds to a more significant improvement in the model over the iterations in question. A line corresponding to the critical threshold \({{\varepsilon }_{{\text{crit}}}}_{2}\) of 0.001 is also added to the plot, signifying the rate of change threshold according to our second stopping criterion. In this case, like the isotropic scenario, after about 120 iterations, the rate of change of variance falls below the critical value, suggesting that further training does not result in significant improvements. Furthermore, the variance values also fall within the acceptable range set by our first criterion, confirming that a training size of approximately 220 data points can be considered sufficient.

Fig. 11
figure 11

Assessment of the variance rate change, used to determine a stopping criterion for iterative processes

3.2 Influence of size of initial training set

In active learning, the size of the initial training set is a critical factor influencing the learning process. This initial set is the primary source from which the model acquires knowledge of the given task. Many active learning strategies necessitate a substantial quantity of initially labeled data to achieve a certain level of quality. This enables the model to 'warm up' and subsequently function optimally. Prior to this warm-up phase, random selection often surpasses most active learning strategies in performance. This situation is typically identified as a high-budget regime. The term 'cold start' as investigated by Zhu et al. [43] refers to the limited capacity of a model to capture uncertainty, a problem that is particularly pronounced in low-budget regimes. In these scenarios, budgetary constraints lead to a smaller initially labeled dataset, exacerbating the model's difficulty in handling uncertainty [44, 45].

Based on these considerations, as part of this study, we aim to explore how the size of the initial training set affects model performance, given a fixed budget of 200 training data instances. For instance, we might begin with 20 initially labeled instances followed by 180 active learning iterations. Alternatively, we could start with a larger seed set of 40 instances, leading to 160 subsequent active learning iterations. Variations on this theme will be tested to explore the range of possible outcomes. The principal motivation behind this method is to delineate the trade-offs between the size of the initial training set and the number of active learning iterations under budget constraints. This approach promises to shed light on how best to allocate resources for optimal learning outcomes. For example, we aim to ascertain whether an increased initial seed set improves model performance sufficiently to counterbalance a reduced number of active learning iterations. Conversely, a more advantageous pathway might emerge in having a smaller initial seed set coupled with an increased number of opportunities for the model to actively learn from new instances. The insights gleaned from this investigation will significantly enhance our understanding of the effects of initial model configuration on long-term learning efficiency within given budgetary parameters.

Building on this, our experiment was designed to test initial training sizes of 20, 40, 80, 100, and 120 instances within a total data budget of 200. For each of these scenarios, the training was done using an anisotropic material as described in Sect. 3.1.2. After training, the average MCC scores using five distinct test sets as described in Sect. 2.4 were recorded and shown in Fig. 12.

Fig. 12
figure 12

Comparison of average MCC scores for different initial training sizes (20, 40, 80, 100, 120) within a fixed budget of 200 data points. Each score is derived from five distinct test sets, each containing 600 test cases, to illustrate the trade-off between initial training size and model performance within a fixed budget of data points available for training

Based on Fig. 12, the MCC scores improve with an increase in the initial training size up to a point. Starting from the lowest initial training size of 20, the minimum MCC score was just 0.15 and the maximum 0.89. As the initial training size increased to 40, the minimum MCC increased significantly to 0.41 and the maximum dipped slightly to 0.83. A substantial leap in performance is observed when the initial training size reaches 80, yielding a minimum MCC of 0.75 and a maximum MCC of 0.95. However, the crucial consideration is the trade-off between the initial training size and the subsequent active learning iterations. The results show that the optimal point is reached with an initial training size of 100 instances. Here, the MCC ranges from 0.79 to 0.98, indicating both a higher minimum and maximum performance than with smaller initial training sizes. Furthermore, increasing the initial training size to 120 did not improve the performance (with MCC values ranging from 0.78 to 0.98), and slightly reduced the minimum MCC compared to the initial size of 100. Given these findings, it can be concluded that, within the budget of 200 training instances, starting with 100 initially labeled instances provides the most effective balance between the warm-up phase and the number of active learning iterations. It should be noted here that the performance of SVC is heavily influenced by the initial training data, and later improvement of active learning is constructed based on that initial warm up. At very low initial training sizes the initial performance is poor, and constructing the learning process based on that in a high-dimensional feature space cannot further improve the performance compared to increasing the initial training size. However, it is important to note that this does not imply that active learning is ineffective. On the contrary, it highlights the importance of reaching an adequate warm-up phase before expecting significant improvement through active learning.

3.3 Comparison of active learning with static learning using random and uniform training data

This section comprehensively examines and evaluates three data generation strategies: Active Learning, Uniform Data, and Random Data. The motivation behind this comparison lies in understanding their effectiveness under various resource constraints, shedding light on their adaptability and performance across different budget sizes—in this case limited to the range between 140 and 200. As shown in Sect. 3.1.2, after this range the active learning strategy is not capable of further improving the quality as the rate of change in variance is reduced, and continuing the learning is not efficient. In the static learning models, the initial training set comprises the budget-defined number of data points (here stresses). The stresses can either be distributed randomly or uniformly on the surface of a unit sphere with a ratio of 6D to 3D unit stresses equal to 2 as suggested by Shoghi and Hartmaier [17]. For the active learning model, the size of the initial training set is fixed at 100. The remaining budget is utilized for iterative generation of new data points, based on the learning derived from the initial set. For each case, training was performed using an anisotropic material as described in Sect. 3.1.2. The MCC scores were recorded post-training across five distinct test sets, as described in Sect. 2.4. The result is shown in Fig. 13.

Fig. 13
figure 13

Performance comparison of Active Learning, Static Learning Using Uniform Data, and Static Learning Using Random Data. The average Matthews Correlation Coefficient (MCC) is plotted against various budget levels. This highlights the effectiveness of each strategy with different budgets

Figure 13, illustrates the comparison of the average MCC, see Sect. 2.4, against various budget levels in three different learning strategies: Static Learning Using Uniform Data, Active Learning, and Static Learning Using Random Data. The x-axis represents the 'Budget', while the y-axis represents the 'average MCC' score. Each strategy is represented as a separate line plot with distinct colors: blue for the Uniform Data, red for the Active Learning data, and green for the Random Data. Each point on the line represents a specific budget value and its corresponding average MCC. From the plot, the performance of each strategy at different budget levels can be observed. A direct comparison of the strategies at a given budget level can be made by comparing the y-values (average MCC) for a particular budget. This representation enables the effectiveness of each learning strategy under different budget and offers a comparative analysis of various learning strategies under varying budget conditions, providing insights into their performance and efficiency. This might be helpful for selecting the most suitable learning strategy based on the resources available and could be particularly beneficial in scenarios where data collection is costly or time-consuming. Based on such analysis, the trade-offs between time, cost, and achieving a specific score can be understood. For instance, in the active learning approach, half of the budget (100) is used for an initial warm-up phase with the remaining amount allocated for iterative learning. This upfront cost can be offset by the strategic selection of informative data points during the iterative phase, potentially resulting in quicker achievement of high scores and greater overall efficiency. In contrast, static learning strategies, both with uniform and random data, utilize the entire budget. When the budget is not restricted, the static learning approach using uniformly distributed data can perform comparably well with active learning due to the even coverage of the data space. However, the same cannot be said for static learning with randomly selected data. Given the lack of strategic or even distribution in data selection, this approach may require more resources or time to reach similar performance levels as the other strategies.

3.4 Analysis of the sampled points

In this section, the objective is to thoroughly examine the specific regions from which most of the information is gathered by the active learner and to determine whether certain areas are favored for data sampling. The 6D sampling space is subdivided such that the first three dimensions are represented by normal components and the remaining three are represented by shear components. To investigate whether active learning methods display a preference for specific regions or maintain a consistent proportion within these two types of components, a ratio of shear components over normal components is introduced. If \({\varvec{\sigma}}=\left({\sigma }_{11},{\sigma }_{22},{\sigma }_{33},{\sigma }_{23},{\sigma }_{13},{\sigma }_{12}\right)\) is a data point in our search space, this ratio can be defined as:

$$\frac{{\text{Shear}}}{{\text{Normal}}}=\frac{\sqrt{6\left({\sigma }_{23}^{2}+{\sigma }_{13}^{2}+{\sigma }_{12}^{2}\right)}}{\sqrt{\left({\left({\sigma }_{11}-{\sigma }_{22}\right)}^{2}+{\left({\sigma }_{11}-{\sigma }_{33}\right)}^{2}+{\left({\sigma }_{33}-{\sigma }_{11}\right)}^{2}\right)}}$$
(16)

In Fig. 14, the distributions of the ratio of shear components to normal components for iterations 0, 25, 50, 100 of the active learning process are illustrated. For each iteration, a histogram has been plotted that depicts the frequency of occurrence of different \(Shear/Normal\) ratios within the sampled points. The transparency in the histograms allows for the comparison of the distributions at various iterations, with darker areas indicating overlaps in the distributions. Additionally, the trend line for each iteration aids in understanding the shape and spread of the distributions.

Fig. 14
figure 14

The calculated proportion of shear elements to normal components within the sampled data points distribution in various iterations for isotropic reference material for the active learning approach. The vertical lines indicate the bin with the maximum frequency for each histogram, showing the most commonly occurring ratio in each iteration

Based on the insights drawn from Fig. 14, it can be observed that the learner consistently selects areas with a shear to normal ratio around 0.6, irrespective of the iteration. The persistent preference for regions with this ratio could imply that they are particularly valuable during the learning process, especially in the early stages. Shoghi and Hartmaier [17] noted that utilizing only 6D stresses may not provide a comprehensive representation for the training, thereby potentially underrepresenting the subspace of normal stresses. The same calculation was done for the anisotropic case, and as can be seen in Fig. 15, the average ratio is also 0.6. This can further emphasize the necessity for the targeted sampling of areas with a lower ratio of shear to normal stress.

Fig. 15
figure 15

The calculated proportion of shear to normal stress components within the sampled data points distribution in various iterations for anisotropic reference material for active learning. The vertical lines indicate the bin with the maximum frequency for each histogram, showing the most commonly occurring ratio in each iteration

4 Conclusion

In materials engineering, constitutive equations play a pivotal role in predicting how materials will respond when subjected to a load. Central to these models are yield functions, providing the fundamental framework for understanding plasticity. Machine Learning (ML) approaches offer a novel way to craft these yield functions directly from data, adeptly capturing the complex behaviors of materials. However, the efficacy of these ML models relies significantly on the quality of training data. For ML to accurately reflect and predict complex material behaviors, it's important to provide high-quality, representative training data. Building on this foundation, this paper delves into the critical process of generating optimal training data using Query-By-Committee (QBC) algorithm, specifically designed for Support Vector Classification (SVC). The essence of the QBC approach is to identify new training data from feature areas where there's a significant divergence between multiple model predictions, ensuring a more comprehensive and efficient training set.

Starting with different initial training sets, the effectiveness of active learning combined with SVC was examined over a series of iterations, for an isotropic and an anisotropic reference material with key metrics such as variance and the Matthews Correlation Coefficient (MCC) being evaluated for a test data set. The observed trends—a consistent decrease in variance and a general increase in MCC—affirm the effectiveness of the active learning approach. It was concluded that reaching a user-defined value of variance, determined by a desired percentage reduction from the maximum variance, alongside a reduction in the rate of variance, can be considered as a dynamic stopping criterion that reliably indicates the convergence of the training process to an accurate model. When the rate of change is not significant in comparison with the previous value, it signifies that the model's learning progression has plateaued or at least significantly slowed down. At such a juncture, further training of the model becomes inefficient as the performance improvements are marginal, which was observed for isotropic material after 95 iterations and for anisotropic material after 120 iterations.

Our study highlights the importance of the size of the initial training set and its crucial role in the subsequent learning process. As in any active learning strategy it is necessary to have an optimal size of initial training data to reach a certain level of performance quality, which we referred to as warm-up phase. This crucial phase allows the model to establish a solid foundation before it can function at its best. Our investigation was done in a fixed-budget scenario with a predefined maximum number of training data points to establish a fine-tuned approach to finding a balance between the initial training size with sufficient quality and possible gain in the learning process. Our findings underscore that while increasing the initial training size can indeed improve model performance, there is an optimal point of balance between the initial training size and the number of active learning iterations. In this work, starting with 100 initially labeled instances, within the established budget of a total of 200 training data points, yielded the best performance of the final model. This demonstrated that having a substantial initial seed set effectively primes the model, allowing for optimal performance in the subsequent active learning phases. On the other hand, increasing the initial training size beyond this point did not result in significant further improvement. These findings contribute valuable insights for efficient resource allocation and decision-making concerning initial model configuration for long-term efficiency in active learning, considering budgetary constraints.

Our comparative evaluation of three different learning strategies—Active Learning, Static Learning Using Uniform Data, and Static Learning Using Random Data—sheds light on their effectiveness and adaptability under varying resource constraints. Active learning was more efficient in achieving higher scores faster, especially when resources are limited. When resources are plentiful, static learning using uniform data can compete with active learning. However, static learning using random data will likely still underperform without additional resources, reaffirming the importance of strategic data selection even within static learning strategies. This result implies that both strategies have their unique strengths, and the choice between them may depend on the specific budgetary constraints and the performance goals at hand.

The thorough examination of the specific regions within the six-dimensional sampling space from which the active learner predominantly gathers information revealed a consistent preference for areas with a shear to normal stress ratio of around 0.6, regardless of the iteration stage. This finding, which emanates from an initial random distribution, implies that these regions might hold a particular value during the learning process. This becomes more apparent during the early learning stages, when the model is still in the process of establishing its foundational knowledge. This persistent pattern may also emphasize the necessity for a more targeted sampling strategy, particularly focusing on areas with the given ratio of shear to normal stress components.