Introduction

Finding functional and useful crystals or molecular structures is highly challenging, and numerous methods have been proposed.1,2,3,4,5,6 Even when the crystal structures are limited to those for which prototypes are known, the enormous number of possible substitutions of elements leads to considerable difficulty in characterizing the function of physical properties associated with these structures. Within the vast pool of possible crystals or molecular structures, researchers frequently confined their explorations to a looping search set, formulating hypotheses about structures of interest and then verifying these hypotheses by validating physical properties using experiments or theoretical calculations. Naively, this approach is closely associated with the trial-and-error problem-solving method with solution-oriented and problem-specific features.

Solution-oriented approaches seeking to find optimal solutions invest relatively little effort to reveal why and how the solution is found (e.g., optimal structures in materials discovery). Besides, problem-specific approaches make little effort to generalize the solution to different problem scopes (e.g., extending the structure search space). Therefore, these approaches inevitably involve limitations when deployed in a screening space with wider bounds than the known material structures.

In this study, we propose a query-and-learn architecture based on active learning to assist researchers in actively monitoring the material structure discovery process. The query-and-learn method aims to (1) accurately estimate physical properties from the most limited first-principles data, (2) accelerate the search for outstanding structures, (3) interpret the structure search process, and (4) generalize findings by extracting the structure–property correlations. The problem regarding the formation mechanism of SmFe\(_{12}\)-based compounds with the ThMn\(_{12}\) structure is used to demonstrate the effectiveness of the proposed method.

The original structure of iron-rich SmFe\(_{12}\) compounds were first discovered in the late 1980s.9,10,11 It was expected that they would show high saturation magnetization, magnetocrystalline anisotropy, and Curie temperature.12 However, SmFe\(_{12}\) and other families of RFe\(_{12}\) with R denoting a rare earth element have not been widely adopted to produce the excellent magnets that can be obtained owing to the practical difficulty of stabilizing the material. Numerous studies have substituted elements such as Co, Ti, V, Cr, Mo, W, or Ga to obtain a stable ThMn\({_{12}}\)-type phase.13,14,15,16,17,18,19 Unrestricted from ternary compounds, recently researchers have emphasized searching for the most potentially stable SmFe\(_{12}\)-based quaternary compounds using the bi-element substitution method.20,21,22,23,24,25,26,27 Because the stabilizing elements are assumed to be substituted at the Fe sites, a large supercell of SmFe\(_{12}\) should be considered as a host structure to investigate substitution structures with the possibility of diverse elemental substitutions. Therefore, a more efficient methodology to investigate the structure space, where the number of candidates increases combinatorially, is urgently required.

Figure 1 summarizes key components in the query-and-learn active learning design in discovering formable SmFe\(_{12}\)-based compounds in the ThMn\({_{12}}\) structure. At the beginning of the query step, a pool of not-yet-calculated structures is created by applying substitution operators on the prototype of SmFe\({_{12}}\). The system queries the most informative candidates to estimate their properties before updating them to the training data of machine learning predictors. Canonically, the informativeness of queried structures is assumed to show the most significant impact to improve the accuracy of the prediction model. However, the predictive ability term is usually challenging to clarify explicitly because predictive evaluations often lack information on the relative position among new queried-training–testing data. For example, authors in References 28 and 29 reported exploration strategies by assuming out-of-distribution structures as superior structures. Therefore, querying then accurately predicting structures in the out-of-distribution region are on the top demand rather than the task of predicting properties for all not-yet-calculated structures in the pool. Furthermore, the methods by which the estimator inferred the predicted value and the learned function changed by adding queried data are often blind to researchers’ monitoring the discovery process. In the learn step of the query-and-learn design, we extend the prediction model’s interpretability by introducing metric learning in transforming the original structure representation vector into a low-dimensional space, which preserves the smoothness of the function of formation energy. Consequently, information in the structure search progress can be actively monitored including prediction accuracy; features of the learned model, regions of outstanding structures, or inter-correlations between query structures with training structures. Studies of active learning designs used in materials science are shown in References 28 and 30,31,32, besides other machine learning-assisted material designs shown in References 33,34,35.

Figure 1
figure 1

Illustration of the proposed query-and-learn active learning design to discover new SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures. Left: start with a pool of possible initial structures prepared by substituting different elements in different Fe sites and a small data set of calculated structures. The data set contains optimized structures, calculated formation energies, and structural deformations from the initial structures. Middle: from the data set of calculated structures, a two-dimensional embedding space is learned by applying metric learning for kernel regression on the orbital-field matrix representation7, 8 and the calculated formation energy \(\Delta E\) of the optimized structures. A regression function estimating the formation energy \(\Delta E\) from the coordinates of substituted structures in the embedding space is learned from the data set of calculated structures to estimate the \(\Delta E\) for the not-yet-optimized structures. The expected formation energy prediction errors are also used to recommend candidates for the subsequent first-principles calculation from among the structures that have not yet been calculated. The structure–stability relationship is mined as the correlation between local embedding representation and \(\Delta E\). Right: structures with high potential to improve the regression function are queried to first-principles calculations to optimize the structure, estimate the formation energy \(\Delta E\), and evaluate the deformation after structure optimization. The calculated data are then updated to the data set of calculated structures.

The contributions of this work are summarized as follows:

  • We investigate systematically the formation energy and magnetization of 3307 SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) with \(\mathsf {X}, \mathsf {Y}\) as Mo, Zn, Co, Cu, Ti, Al, and Ga, limited by \(\upalpha +\upbeta <4\) using the VASP calculation procedure from OQMD.36

  • We confirm that SmFe\(_{9}\)[Al/Ga]\(_{2}\)Ti structures have the highest stability and SmFe\(_{9}\)Co\(_{3}\) structures have optimal magnetization value.

  • We confirm that the SmFe\(_{12-\upalpha -\upbeta }\)[Al/Ga]\(_{\upalpha }\mathsf {Y}_{\upbeta }\) structures show on average negative formation energies and an increase in the coordination number at substituted sites (Al/Ga), whereas other families showed opposite trends.

  • We propose an active learning design with embedding representation of orbital-field matrix that achieves an optimal prediction accuracy and recalls outstanding structures using limited training data.

  • We extract a relationship of bi-elements substitution to the stability, that is, SmFe\(_{12-\upalpha -\upbeta }\)[Al/Ga/Ti]\(_{\upalpha }\mathsf {Y}_{\upbeta }\) is potentially stable, and SmFe\(_{12-\upalpha -\upbeta }\)[Mo]\(_{\upalpha }\mathsf {Y}_{\upbeta }\) is potentially unstable, which can be interpreted using the embedding representation.

In the following sections, we will explain the proposed approach in detail, and use it for finding potentially stable SmFe\(_{12}\)-based compounds. The exploration space for discovering potentially stable SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures is set with \(\mathsf {X}\) and \(\mathsf {Y}\) as Mo, Zn, Co, Cu, Ti, Al, and Ga, limited by \(\upalpha +\upbeta <4\), where \(\upalpha\) and \(\upbeta\) are integers. We will demonstrate the efficiency of the proposed approach, and show how to extract information associated with structural stability. Details of the data preparation are shown in the “First-principles calculation” section. The “Active learning design” section presents the components of the active learning architecture in detail. Last, the “Experiment and discussion” section shows the performance of active learning designs and the results of interpreting correlations extracted from the embedding space regarding the formation energy.

First-principles calculation

Creation of SmFe12-α-β X α Y β structures

We focus on SmFe\(_{12}\)-based crystalline magnetic materials under the formula SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) with \(\mathsf {X}\) and \(\mathsf {Y}\) as the substituted elements from Mo, Zn, Co, Cu, Ti, Al, and Ga; \(\upalpha\) and \(\upbeta\) are integer numbers of \(\mathsf {X}\) and \(\mathsf {Y}\) compositions, respectively. A hypothetical-not-yet-calculated structure is created by substituting \(\upalpha\) iron sites with the element \(\mathsf {X}\) and \(\upbeta\) iron sites with the element \(\mathsf {Y}\). There are numerous possible hypothetical structures; hence, we limit our investigation to \(\upalpha + \upbeta < 4\). Owing to the symmetrical properties of the iron sites in the host SmFe\(_{12}\) structure, new substituted structures were compared with one another to remove duplications. We followed the comparison procedure proposed by qmpy, a Python application programming interface of OQMD.36 The internal coordinates of the structures were compared by examining all rotations allowed by each lattice and searching for rotations and translations to map the atoms of the same species into one another within a given level of tolerance. Here, any two structures with a percent deviation in lattice parameters and angles smaller than 0.1 were considered identical. Furthermore, we applied our designed orbital-field matrix (OFM)7,8 to eliminate duplication. Notably, two structures were considered the same when the \(L_{2}\) norm of the OFM difference was less than \(10^{-3}\).

To initialize the active learning model data set, we substituted one atom from Mo, Zn, Co, Cu, Ti, Al, and Ga to one iron site of the SmFe\(_{12}\) host structure. Consequently, there were 283 structures under the formula SmFe\(_{12-\upalpha }\mathsf {X}_{\upalpha }\) with \(\upalpha \in \{1, 2, 3\}\). By substituting two elements, we created 3024 structures using the formula SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) with \(\upalpha + \upbeta < 4\). We used this data set as an initial of not-yet-calculated data set \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{1}\); a detailed description is provided in the “Data set notation” section. To rephrase, this data set is considered a screening space/exploration space for the exploration process; we retain all these unknown structures as distinct from the initial space. Subsequently, all structures were subjected to structural optimization through first-principles calculations to obtain the optimal structures.

Assessment of formation energy of structures

The first-principles calculations using density functional theory (DFT)37,38 are among the most practical calculation methods used in materials science. DFT calculations precisely estimate the total energy of the materials, which can be used to determine the formation energy of the substituted structure. The formation energy of a given structure s is defined as follows:

$$\Delta E[s] = \frac{1}{N} (E[s] - \Upsigma _{i}^{N}E[s_{i}] ),$$
(1)

where \(\Delta E[s]\), E[s], and \(E[s_{i}]\) are the formation energy, total energy of structure s per formula unit, and simple substance \(s_{i}\) per atom, respectively. Finally, N is the total number of atoms in the formula unit of s. The simple substances were chosen as (1) \(Im{\text {-}}3m\) with Fe and Mo, (2) \(R{\text {-}}3m\) with Sm and Al, (3) \(Fm{\text {-}}3m\) with Cu and Co, (4) P6/mmm with Ti, and (5) P63/mmc with Zn. Details of the substances chosen are provided in the Supplementary Information. A structure whose formation energy lies below or lower than zero, that is, \(\Delta E \le 0\), is a potentially formable material in nature, whereas a structure associated with \(\Delta E > 0\) could be considered unstable. For the competing phases, the stability of the structure should be discussed using the hull distance. In this study, we use the formation energy defined in Equation 1 as an index for simplicity. The relationship between the experimental material and the hull distance at \(T=0\,{\text {K}}\) has been summarized in References 39 and 40. The stability of the magnets at finite temperature can be found in Reference 41. We discuss in detail the reliability of this calculation in the Supplementary Information.

In this study, we follow the computational settings of OQMD36,42 to estimate the total energy of all structures. The calculations were performed using the Vienna ab initio simulation package (VASP)43,44 by utilizing the projector-augmented wave method potentials45,46 and the Perdew–Burke–Ernzerhof47 exchange–correlation functional. Pseudopotentials used in this work were collected from POTCAR library version 5.4 of VASP.45,48,49,50,51 With the 4f element of Sm, Sm\(^{3+}\) potentials were applied where five electrons in f shell were treated as core electrons. Details of potential for other elements is shown in the Supplemental Information, with notation as shown in Reference 49.

All calculations were spin-polarized with the ferromagnetic alignment of the spins. For a given structure, we performed three optimization steps following the coarse relax, fine relax, and standard procedures provided by OQMD. The k-points per reciprocal lattice for these calculation series were selected as 4000, 6000, and 8000 for coarse relax, fine relax, and standard, respectively. Optimal lattice parameters from the last step were used as the initial setting for the next step. We set 520 eV as the cutoff energy in the standard calculation step. The total energies of the final converged calculations were used to estimate the formation energy, \(\Delta E\).

In addition, the total magnetic moment of these materials \(\upmu [s]\) was recalculated because we used an open-core approximation to treat the 4f electrons of Sm, as follows:

$$\upmu [s] = \Upsigma _{i}m[s_{i}] + \Upsigma {_k} J_{4{\text {f}}}g_{J_{4{\text {f}}}}[s_{k}],$$
(2)

where \(m[s_{i}]\) is the magnetic moment of atom i, \(J_{4{\text {f}}} g_{J_{4{\text {f}}}}[s_{k}]\) is the correction term with \(g_{J_{4{\text {f}}}}\) as the Lande factor, and \(J_{4{\text {f}}}\) is the angular momentum of lanthanide \(s_{k}\). Index i represents all atoms, and index k represents all lanthanide atoms in the structure. The contribution of the 4f electrons of Sm to the magnetization is \(J_{g_{\text {J}}} = 0.714\). In this paper, this value is finally converted to magnetization per formula unit, M (T/f.u.).

Active learning design

There are three essential components in the proposed active learning approach, including (1) a pool \({\mathcal {D}}\) of not-yet-calculated structures (non-optimized) and first-principles calculated (optimized) structures, (2) an estimator \(\mathtt {E}\) to predict the target formation energy, and (3) an acquisition function \(\upalpha\) to estimate the structures that should be queried in order of priority to enhance the prediction ability of \(\mathtt {E}\).

Data set notation

For a given query time t, we denote \({\mathcal {D}}^{\mathsf {calculated}}_{[1:t]-1}\) as the data set comprising all the structures queried and optimized by first-principle calculation at the start of the query time t. We also denote \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\) as the data set with the remainder of not-yet-calculated structures at the start of the query time t. From \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\), we evaluate data sets \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\) such that by adding the calculated results of \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\) to \({\mathcal {D}}^{\mathsf {calculated}}_{[1:t]-1}\) we can improve the prediction ability of the estimator \(\mathtt {E}\). \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\) is queried by the acquisition functions described in the “Acquisition function” section (\({\mathcal {D}}^{\mathsf {beneficial}}_{t} \subset {\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\)). To evaluate the ability of the active learning system to search potentially stable structures, we also collect \({\mathcal {D}}^{\mathsf {outstanding}}_{t}\) from \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\) as a set of structures that are expected to be stable. Within the scope of finding the most potentially stable substituted SmFe\(_{12}\) families, if the calculated or predicted formation energy \(\Delta E\) is smaller than \(-0.1\) (eV/atom), the structure is considered potentially stable. At the time t of querying process, a predetermined number of structures with the lowest \(\Delta E_{\text {pred}}\) predicted by \(\mathtt {E}\) are then added to \({\mathcal {D}}^{\mathsf {outstanding}}_{t}\) for verification using first-principles calculations. First-principles calculations are then carried out for all the structures in \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\), and the obtained optimized structures are added to \({\mathcal {D}}^{\mathsf {calculated}}_{[1:t]-1}\) to get \({\mathcal {D}}^{\mathsf {calculated}}_{[1:t]}\). We then use \({\mathcal {D}}^{\mathsf {calculated}}_{[1:t]}\) as the training data for learning the estimator \(\mathtt {E}\). All the optimized structures confirmed using first-principles calculation with a formation energy lower below the specified limit are considered as potentially stable structures, and they are added to data set \({\mathcal {D}}^{\mathsf {outstanding}}_\mathsf {confirmed}\), which comprises all the potentially stable structures that are confirmed up to this point. The set of all the structures estimated using the estimator \(\mathtt {E}\) as potentially stable structures is denoted by \({\mathcal {D}}^{\mathsf {outstanding}}_\mathsf {estimated}\). The pseudo-code summarization of the entire query-and-learn process is shown in the Supplemental Information.

In \({\mathcal {D}}^{\mathsf {calculated}}_{[1:t]}\), we represent calculated structures accumulating up to t with representation vectors as \({\mathbf {x}}_{[1:t]}\) and formation energy as \({\mathbf {y}}_{[1:t]}\). The formation energy of SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures is described in the “First-principles calculation” section. For simplicity, we denote \({\mathbf {x}}\) as a representation vector of not-yet-calculated structures, normal subscript denotes data point index, bracket subscript \([1{\text {:}}t]\) represent for collected data up to t, and superscript represents the index of feature. In this study, we applied the OFM7,8 as a descriptor to represent all structures. In OFM representation, the most outer-shell electron configuration is set as a representation of each composition site. Details of OFM atomic representation is used in Element.electronic_structure in pymatgen52 and the summary in Table I in the Supplemental Information. All elements in the OFM appear in the form of \((u^{i}, u^{j})\), which counts the average coordination number of neighbors \(u^{j}\) surrounding the center \(u^{i}\). By representing each atom using outer-shell electron configuration, each individual matrix element is associated with one specific coordination number of a pair of elements in a given structure. Practical interpretation samples are shown in References 53 and 54. In this work, after removing features with zero in all structures, we finally required an 88-dimensional orbital-field vector to represent all SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures.

Gaussian process estimator

The Gaussian process estimator assumes that the joint distribution of the observed values \({\mathbf {y}}_{[1:t]}\) and predicted values \(\hat{\mathbf{y }}\) follow the Gaussian prior distribution, expressed as follows:

$$\begin{bmatrix} \mathbf{y} _{[1:t]} \\ \hat{\mathbf{y }} \end{bmatrix} = {\mathcal {N}} \left( 0, \begin{bmatrix} \upkappa ({{\mathbf {x}}}_{[1:t]} , {{\mathbf {x}}}_{[1:t]} ) \upkappa ({{\mathbf {x}}}_{[1:t]} , {{\mathbf {x}}}) \\ \upkappa ({{\mathbf {x}}}, {{\mathbf {x}}}_{[1:t]} ) \upkappa ({{\mathbf {x}}}, {{\mathbf {x}}}) \end{bmatrix} \right) .$$
(3)

With these assumptions, the predicted values for the unknown state points follow the conditional distribution calculated by updating the prior probability distribution after observing the sampled state points. Thus, \(\hat{\mathbf{y }} \approx {\mathcal {N}} \left( {{\varvec{\upmu }}}({\mathbf {x}}), {\varvec{\upsigma }}({\mathbf {x}})\right)\) with mean \({\varvec{\upmu }}\) and variance \({\varvec{\upsigma }}\) are estimated as

$$\begin{aligned} {{\varvec{\upmu }}}({\mathbf {x}})= \,& {} \upkappa ({{\mathbf {x}}}, {{\mathbf {x}}}_{[1:t]}) {\upkappa ({{\mathbf {x}}}_{[1:t]}, {{\mathbf {x}}}_{[1:t]})}^{-1} \mathbf{y _{[1:t]}} , \end{aligned}$$
(4)
$$\begin{aligned} {{\varvec{\upsigma }}}({\mathbf {x}})= \,& {} \upkappa ({{\mathbf {x}}}_{[1:t]}, {\mathbf {x}}) \\&\quad - \upkappa ({\mathbf {x}}, {{\mathbf {x}}}_{[1:t]}) {\upkappa ({{\mathbf {x}}}_{[1:t]}, {{\mathbf {x}}}_{[1:t]})}^{-1} \upkappa ({{\mathbf {x}}}_{[1:t]},{\mathbf {x}}). \end{aligned}$$
(5)

The mean \({\varvec{\upmu }}\) and variance \({\varvec{\upsigma }}\) are the main components used to construct the acquisition functions, which are introduced in the “Acquisition function” section. The most conventional kernel, known as the Gaussian kernel \(\upkappa _{ij}\) is defined as the kernel between \({\mathbf {x}}_{i}\) and \({\mathbf {x}}_{j}\) as follows:

$$\upkappa _{ij} := \upkappa ({\mathbf {x}}_{i}, {\mathbf {x}}_{j}) = \frac{1}{\upepsilon \sqrt{2\pi }}{\text {e}}^{- \left[ \frac{d({\mathbf {x}}_{i}, {\mathbf {x}}_{j})}{\upepsilon }\right] ^{2} },$$
(6)

where \(\upepsilon\) is a hyperparameter that is tunable to learn the best form of the kernel and d is conventionally defined as the Euclidean distance.

Metric learning

Human intuition regarding the Euclidean distance among data points from three-dimensional spaces often does not apply to higher-dimensional cases. In high-dimensional spaces (e.g., the 88-dimensional orbital-field vector in this work), if an enormous number of examples are distributed uniformly in a high-dimensional hypercube, most examples are closer to the face of the hypercube than to their nearest neighbor. If we approximate a hypersphere by a hypercube, in high dimensions, almost all the volume of the hypercube is outside the hypersphere.55 Moreover, with increasing dimensionality, the distance to the nearest neighbor approaches the distance to the farthest neighbor,56 which implies that the learned weight of the Gaussian process could be meaningless in distinguishing between neighbors and distant data points. In the following, we observe that estimators working on high-dimensional spaces show more difficulty in converging to obtain suitable prediction accuracy; in other words, it is more difficult to estimate both distant and neighbor data points.

To overcome the curse of high dimensionality as well as perform tracking to see how the learned function is created, we propose the use of a metric learning algorithm for kernel regression—MLKR,57 which optimizes the smoothness of dependence between a representation vector and a target property. First, the Mahalanobis distance \(d({\mathbf {x}}_{i}, {\mathbf {x}}_{j})\) is defined as a linear transformation of conventional Euclidean distance as follows:

$$d({\mathbf {x}}_{i}, {\mathbf {x}}_{j}) = || \mathsf {A}({\mathbf {x}}_{i} - {\mathbf {x}}_{j}) ||^{2} ,$$
(7)

where \(\mathsf {A}\) is a linear transformation matrix. The MLKR method attempts to optimize the loss function \({\mathcal {L}}\) defined by the training error as

$${\mathcal {L}} = \Upsigma _{i} (y_{i} - {\hat{y}}_{i})^{2} .$$
(8)

With the defined kernel in Equation 6, we can iteratively find the optimal \(\mathsf {A}\) by \(\Delta \mathsf {A}\), defined as

$$\Delta \mathsf {A} = \uplambda \frac{\partial {\mathcal {L}}}{\partial \mathsf {A}} = 4\uplambda \mathsf {A} \Upsigma _{i} (y_{i} - {\hat{y}}_{i}) \Upsigma _{j} (y_{j} - {\hat{y}}_{j}) \upkappa _{ij} {\mathbf {x}}_{ij}{\mathbf {x}}_{ij}^{\top },$$
(9)

with \({\mathbf {x}}_{ij}:={\mathbf {x}}_{i} - {\mathbf {x}}_{j}\). The matrix \(\mathsf {A}\) is gradually optimized to find the best embedding space. Therefore, we obtain a new embedding representation \(\mathbf{u} :=\mathsf {A}{\mathbf {x}}\) by linear transformation of the original vector \({\mathbf {x}}\). From the definition of \({\mathcal {L}}\), the function of the target property learned on \(\mathbf{u}\) is optimized to smoothly traverse through data points. Moreover, \(\mathbf{u}\) with its low dimension, 2D in our setting, is expected to be of benefit for both prediction estimators and human intuition regarding the Euclidean distance compared with the conventional \({\mathbf {x}}\) for 88 dimensions in the OFM representation.

Embedding function interpretation

Maximizing the prediction ability of the machine learning estimator using the most limited training data is the first priority of the active learning method. The process of querying new labeled data are equivalent with correcting the form of learned function with respect to target property. For example, in binary classification, asserting data points with maximal variance of predicted class labels is equivalent to locating the boundary that separates two observed classes. Therefore, as an alternative advantage, following the correction process leads to better insight regarding the phenomena of interest. In this work, we introduce a method to localize information on the learned function, monitoring its change to improve the querying data process in interpreting the phenomena of interest.

With the target property as a continuous variable, we consider the learned formation energy function \(\mathbf{y} =f(\mathbf{u} )\), which is called interpretable if it is possible to allocate on the representation space \(\mathbf{u}\), where the function meets a predefined condition g. In detail, given a condition g, the probability distribution spanning on the embedding space \(\mathbf{u}\) is defined as follows:

$$p({\varvec{u}}|g) = \frac{1}{nh} \Upsigma _{i=1}^{n} {\mathbb{1}}[g({\varvec{u}}_{i})] {\text {e}}^{-\frac{|{\varvec{u}}_{i} - {\varvec{u}}|}{h}},$$
(10)

with \(p(\mathbf{u} |g)\) as the probability density at \(\mathbf{u}\) under g, \(\mathbf{u} _{i}\) as the location of an observed data point i (i.e., \(\mathbf{u} _{i}= \mathsf {A}{\mathbf {x}}_{i}\)); h as a tuning kernel width. The indicator \({\mathbb{1}}[\cdot ]\) returns 1 if the condition \([\cdot ]\) is true, and 0 otherwise. In the present work, we consider two forms of relevant conditions.

(11)
(12)

where \({g_{y}}(\mathbf{u} _{i})\) and \(g_{ {\mathbf {x}}^{j}}(\mathbf{u} _{i})\) intuitively represent a region of interest with potentially stable materials and regions spanned by structures incorporating the nonzero OFM element \({\mathbf {x}}^{j}\). Then, we measure the Bhattacharyya coefficient58 between a pair of \((g_{y}, g_{{\mathbf {x}}^j})\) as

$${\text {BC}}(g_{y}, g_{{\mathbf {x}}^j}) = \int \sqrt{p(\mathbf{u} |g_{y}) p(\mathbf{u} |g_{{\mathbf {x}}^j})} \,{\text {d}}{} \mathbf{u},$$
(13)

with the integral taken over the space spanned by \(\mathbf{u}\). The Bhattacharyya coefficient \({\text {BC}}(g_{y}, g_{{\mathbf {x}}^j})\) measures the probability of joint occurrence between two conditions \(g_{y}\) and \(g_{{\mathbf {x}}^j}\). Higher BC values indicate a higher possibility to obtain correlation between conditions \(g_{y}\) and \(g_{{\mathbf {x}}^j}\) and vice versa; this makes it easier to understand the meaning of the BC coefficient in identifying overlapping distributions. In the discussion of the results provided in the “Results and discussion” section, we characterize any distribution \(p(\mathbf{u} |g)\) using a single-level contour representation.

Acquisition function

The acquisition function \(\Upgamma ({\mathbf {x}})\) quantifies the reward of structures in each \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\) that contributes to the prediction accuracy of the estimation models, as well as the exploration process. Structures \({\mathbf {x}}^{*}\) are queried to \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\) to calculate their formation energy if their acquisition function values reach an optimal value.

$${\mathbf {x}}^{*} = {{\,{\text{arg\,max}\,}}}\Upgamma ({\mathbf {x}}) .$$
(14)

The majority form of \(\Upgamma\) is designed to determine the optimum of a fixed expensive-to-compute function. In this work, we examine the two most canonical functions as follows:

$$\begin{aligned} {\Upgamma }_{\text {exr}} ({\mathbf {x}})= \,& {} {{\varvec{\upsigma }}}({\mathbf {x}}), \end{aligned}$$
(15)
$$\begin{aligned} {\Upgamma }_{\text {exp}} ({\mathbf {x}})= & {} -{{\varvec{\upmu }}}({\mathbf {x}}), \end{aligned}$$
(16)

where \({{\varvec{\upmu }}}({\mathbf {x}})\) and \({{\varvec{\upsigma }}}({\mathbf {x}})\) are the mean and variance of estimated values of not-yet-calculated structure \({\mathbf {x}}\), respectively. In representation space upon which the estimator is located, \({\mathbf {x}}\) is either an OFM vector or embedding vector \(\mathbf{u} =\mathsf {A}{\mathbf {x}}\) learned by the metric learning method. The first acquisition function, \({\Upgamma }_{\text {exr}}\), based on the exploration strategy, assumes not-yet-calculated structures with higher variance to enhance the prediction ability of the estimator (i.e., are beneficial to the machine learning model). This acquisition function does not support directly finding superior structures because the information of the absolute value of the target property has not been included. The second acquisition function, \({\Upgamma }_{\text {exp}}\), based on the exploitation strategy, selects not-yet-calculated structures with the lowest predicted target values as potential candidates to enhance the prediction ability of the estimator. Numerous acquisition functions31,59,60,61 have been introduced to balance the exploration and exploitation assumptions. Finally, we also examine an acquisition strategy \({\Upgamma }_{\text {uni}}\) that randomly selects from the pool of not-yet-calculated structures.

Experiment and discussion

Experimental setup

We designed an experiment to simulate the process of exploring SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures with \(\mathsf {X}, \mathsf {Y}\) as Mo, Zn, Co, Cu, Ti, Al, and Ga using the proposed query-and-learn method. We collected ternary compounds—SmFe\(_{12-\upalpha }\mathsf {X}_{\upalpha }\) structures (\(\upalpha <4\)) to use as the initial training data and quaternary compounds SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) and \(\upalpha +\upbeta <4\) as the initial pool of not-yet-calculated data. Consequently, at the initial time of the exploration process, all not-yet-calculated structures were created using the bi-element substitution method rather than the single element substitution method as training structures. We summarize the initial training structures in Figure 2, which shows the primary state of the training data \({\mathcal {D}}^{\mathsf {calculated}}\) with SmFe\(_{12-\upalpha }\mathsf {X}_{\upalpha }\) structures (\(\upalpha <4\)). In this figure, the structures are all referenced to SmFe\(_{12}\) values of formation energy (0.07 eV/atom) and magnetization (2.011 T/f.u.). Substituting Ti, Al, Co, and Ga regularly creates substituted structures with formation energies lower than the reference value of SmFe\(_{12}\). Among them, Ti and Al show a higher rate in creating negative formation energy structures than others. With Mo, Zn, and Cu, several substituted structures are more stable than the host SmFe\(_{12}\), whereas the others are not. A part of our calculations were found to be consistent with other first-principles calculation methods such as the Quantum MAterials Simulator (QMAS),62,63,64 OpenMX,20 or experimental results.13,65,66 Details of comparisons are shown in the Supplementary Information section. Figure 2 in the Supplemental Information shows summarization of all SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures in the region of \(\upalpha +\upbeta <4\). All structures were described using 88-dimensional OFM vectors after eliminating duplicated columns.

Figure 2
figure 2

Formation energy and magnetization of SmFe\(_{12-\upalpha }\mathsf {X}_{\upalpha }\) structures \((\upalpha < 4)\), used as an initial training set of active learning systems. The substances are denoted as Ga (pink), Mo (orange), Zn (yellow), Co (light blue), Cu (navy), Ti (maroon), and Al (green).

For a time query t, two batches of structures were selected, denoted by \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\) and \({\mathcal {D}}^{\mathsf {outstanding}}_{t}\). A detailed description of all batches is provided in the “Data set notation” section. We set 40 as the number of selected structures for each \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\) and \({\mathcal {D}}^{\mathsf {outstanding}}_{t}\). Besides this, to evaluate performance of each strategy, we added 20 random structures to \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\). Finally, there were 30 query times to collect all structures in the screening space.

Results and discussion

Query-and-learn in monitoring the SmFe12-α-β X α Y β structures discovery process

We now present the proposed query-and-learn method designed to monitor the materials discovery process. The relative positions of not-yet-calculated, calculated and queried structures, the form of the formation energy function, and generalizing knowledge of the structure–stability mechanism of SmFe\(_{12}\) family are discussed.

Figure 3 shows the learned embedding function regarding the formation energy of SmFe\(_{12}\) structures. In this figure, we show the results of a random querying strategy with the initial query \(t=1\) on the upper panel and the last query \(t=30\) on the lower panel. We demonstrate the results of different strategies in querying structures in the Supplemental Information. The calculated structures are denoted using face and edge color, which indicate the portion of each substituted element. For each query time t, non-calculated structures are shown as gray dots. White rhombus markers indicate structures that were queried at t in \({\mathcal {D}}^{\mathsf {beneficial}}_{t}\) and white triangle markers indicate estimation regarding the most potentially stable structures in \({\mathcal {D}}^{\mathsf {outstanding}}_{t}\). For each query time, we show in the left and middle column of Figure 3 the predicted formation energy \(\hat{\mathbf{y }}\) and the estimated variance \(\upsigma (\hat{\mathbf{y }})\) deriving from f, respectively. Moreover, we show in the right of Figure 3 the absolute error \(|\mathbf{y} - \hat{\mathbf{y }}|\) in prediction between the ground truth first-principles method \(\mathbf{y}\) and the calculated formation energy Gaussian process regression \(\hat{\mathbf{y }}\). In each t, we evaluated the error in predicting formation energy for all not-yet-calculated structures in \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\).

Figure 3
figure 3

Distribution of SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures on embedding spaces at initial time \(t=1\) (upper panel) and the last query \(t=30\) (lower panel). Predicted formation energy \({\Delta E}_{\text {pred}}\) and its variance \(\upsigma ({\Delta E}_{\text {pred}})\) of structures learned from the Gaussian process are shown in background colors in the first and the second column, respectively. Absolute error in predicting the formation energies \(|{\Delta E}_{\text {pred}} - {\Delta E}_{\text {DFT}}|\) are interpolated in the background of the third column. Not-yet-calculated structures are shown as gray dots.

Values of all these attributes \(\hat{\mathbf{y }}, \upsigma (\hat{\mathbf{y }})\) and \(|\mathbf{y} - \hat{\mathbf{y }}|\) shown in background color with nearest-neighbor interpolation in the embedding space. From this figure, the learned function of \(\hat{\mathbf{y }}\) appears as a smooth function traversing throughout all structures between negative to positive formation energy regions. Although queried structures were randomized well and distributed throughout the entire structure space using \({\Upgamma }_{\text {uni}}\), our predicted potentially stable structures (white triangles) were also accurately allocated in the most negative formation energy region. Moreover, in \(t=1\), not-yet-calculated structures using the bi-element substituted method are uniformly dispersed throughout calculated structures with the single substituted element method.

Next, we investigated the learned formation energy function on embedding space via extremum interpretation. Figure 4 shows the formation energy landscape generated by embedding representation in the first and the last query time. Aiming to stabilize the SmFe\(_{12}\) structure, the local minima of the formation energy function is defined as our region of interest, which contains the most negative formation energy structures in \({\mathcal {D}}^{\mathsf {outstanding}}_{\mathsf {estimated}}\). This region is defined as the distribution spanned by \(p(\mathbf{u} |{g_{y}})\) in the “Embedding function interpretation” section. In Figure 4, \(p(\mathbf{u} |{g_{y}})\) distributions are shown in red contours. In the following discussion, we refer to these distributions as the target contours for simplicity. By contrast, distributions of structures with nonzero OFM features defined as \(p(\mathbf{u} |g_{{\mathbf {x}}^j})\) are shown in the embedding space via other contour lines. Intuitively, higher overlapping contours show a higher correlation between these properties. In the middle and the right of Figure 4, we show the projected OFM features \(p(\mathbf{u} |g_{({\text {p}}^{1}, {M})})\) and \(p(\mathbf{u} |g_{({\text {d}}^{5}, {M})})\), with M as \({\text {s}}^{1}, {\text {s}}^{2}, {\text {p}}^{1}, {\text {d}}^{2}, {\text {d}}^{5}, {\text {d}}^{6}\) and \({\text {d}}^{7}\), respectively. OFM features \(({\text {d}}^{5}, {M})\) show the average coordination number of sites owning M atomic representation surrounding Mo. Similarly, \(({\text {p}}^{1}, {M})\) shows the average coordination number of atoms with M representation surrounding Al or Ga because these two elements share \({\text {p}}^{1}\) in their most outer-shell electron configuration. In the last query time, \(t=30\) or equivalently after collecting all calculated structures, one might recognize that the region of negative formation energy mostly overlaps with all \(({\text {p}}^{1}, {M})\) contours—regions spanning Al- and Ga-substituted structures. Among them, the \(({\text {p}}^{1}, {\text {d}}^{2})\) contour shows the distribution of SmFe\(_{12-\upalpha -\upbeta }\)[Al/Ga]\(_{\upalpha }\)Ti\(_{\upbeta }\) structures within the most negative formation energy region. In the end of labeling all structures, Figure 2 in the Supplemental Information shows SmFe\(_{9}\)[Al/Ga]\(_{2}\)Ti structures as the most negative formation energy. By contrast, structures with Mo-substituted elements show distancing from the potentially stable regions. Notably, these correlations between the substituted element and corresponding stability could be found at the beginning of the querying process.

Figure 4
figure 4

Distribution of SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures on the formation energy landscape learned by embedding representation, at the initial (left) and the last (right) query time. Red contours indicate regions of negative formation energy structures \(p(\mathbf{u} |g_{{\Delta E}_{\text {pred}}})\). Distribution of SmFe\(_{12-\upalpha -\upbeta }\)[Al/Ga]\(_{\upalpha }\mathsf {Y}_{\upbeta }\)\(({\text{p}}^{1}, {\text{M}})\) and SmFe\(_{12-\upalpha -\upbeta }\)Mo\(_{\upalpha }\mathsf {Y}_{\upbeta }\)\(({\text {d}}^{5}, {\text {M}})\) are shown in other contours. Relative distributions of Al/Ga- and Mo-substituted structures with the potentially stable region are found from the beginning of the exploration process.

Figure 5 shows the dependence of normalized \({\text {BC}}(g_{y}, g_{{\mathbf {x}}^j})\) on query time t in the active learning process for all OFM features. OFM features show a matrix with blocks of similar center atom representation; each block is presented keeping a similar order of neighbor representation. From Figure 5, structures with \(({\text {s}}^{1}, {\text {M}})\) and \(({\text {d}}^{5}, {\text {M}})\) features (i.e., Cu- and Mo-substituted structures) showed the lowest BC scores for all t. In other words, these structures were not located within the region of negative formation energy. By contrast, \({\text {BC}}(g_{y}, g_{({\text {p}}^{1}, {\text {M}})})\) always remained at the highest score, or as we showed previously in the learned embedding space, these SmFe\(_{12-\upalpha -\upbeta }\)[Al/Ga]\(_{\upalpha }\)B\(_{\upbeta }\) structures were mostly associated with negative formation energy. We show another example in interpreting the substituted effect using \({\text {BC}}(g_{y}, g_{({\text {d}}^{2}, {\text {M}})})\) or Ti-substituted structures. Structures excluding \(({\text {d}}^{2}, {\text {s}}^{1})\) and \(({\text {d}}^{2}, {\text {d}}^{5})\), that is, except SmFe\(_{12-\upalpha -\upbeta }\)[Mo/Cu]\(_{\upalpha }\)Ti\(_{\upbeta }\), showed high possibility of negative formation energy. All these correlations were established by analyzing all queried data shown in the Supplementary information. Interestingly, these correlations could be performed very early, even at the beginning of the exploration process. In summary, the BC score on a learned embedding space is potentially useful in understanding the form of the formation energy function and determining where interesting information is located without labeling all data.

Figure 5
figure 5

Time-dependent Bhattacharyya coefficients \({\text {BC}}(g_{y}, g_{{\mathbf {x}}^j})\) between the distribution of expected outstanding structures \(p(\mathbf{u} |g_{y})\) and distribution of structures owing nonzero \({\mathbf {x}}^j\) OFM element, \(p(\mathbf{u} |g_{{\mathbf {x}}^j})\). The lowest \({\text {BC}}(g_{y}, g_{({\text {s}}^{1}, {M})})\) values indicate SmFe\(_{12-\upalpha -\upbeta }\)[Mo/Cu]\(_{\upalpha }\mathsf {Y}_{\upbeta }\) structures higher distancing on negative formation energy region. In contrast, high \({\text {BC}}(g_{y}, g_{({\text {p}}^{1}, {M})})\) values indicate SmFe\(_{12-\upalpha -\upbeta }\)[Al/Ga]\(_{\upalpha }\mathsf {Y}_{\upbeta }\) structures high possibility to form in nature. This information is found from the beginning of the structure query process.

Prediction ability of active learning designs

We examine the prediction accuracies of different active learning designs. For any query time t, we measured the mean absolute error (MAE) between the predicted and observed formation energies of structures in \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\). Because different structure querying strategies update their training data differently, not-yet-calculated structures in \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\) also differed among experiments. Therefore, MAE measured on \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\) can be approximated as the natural prediction loss of our designed system. Figure 6a shows the MAE results of active learning designs drawn from possible combinations of representation methods, estimators, and querying strategies. In this figure, with three acquisition functions, including uniform, exploration, and exploitation functions, experiments using the OFM representation are denoted in cyan, green, and blue, respectively. By contrast, active learning designs based on embedding representations are shown in yellow, orange, and red, respectively, with these three acquisition functions. Finally, we independently evaluate each of the six active learning designs ten times with different initial random structures in order to evaluate the prediction accuracies.

Figure 6
figure 6

Learning rate of active learners on orbital-field matrix representation space (cyan, green, and blue) and embedded metric learning space (yellow, orange, and red) with different acquisition functions \(\upalpha\). Exploration, exploitation, and uniform acquisition strategies are denoted by \({\Upgamma }_{\text {exr}}, {\Upgamma }_{\text {exp}}\), and \({\Upgamma }_{\text {uni}}\), respectively. (a) Dependence of the mean absolute error in predicting structures in \({\mathcal {D}}^{\mathsf {\lnot calculated}}_{t}\) on querying time t. (b) Recall rate in querying the most potentially stable structure \({\mathcal {D}}^{\mathsf {outstanding}}_{t}\) depending on querying time t.

The difference between active learning designs primarily depended on the nature of the representation method. At \(t=1\), all active learning systems obtained MAE at \(2.5\times 10^{-2}\) (eV/atom). Overall, MAEs gradually decreased with increasing t for all the active learning systems. However, the performance of designs with high-dimensional OFM representation showed gradual linear improvement by adding new queried structures. This could be explained as new queried data points that are added using this strategy help the estimator forecast their neighbor only, rather than correcting the estimator learning on the entire dataspace. The MAE curve with the highest fluctuation belonged to a system using exploitation querying strategies. In other words, adding excessively biased data (e.g., low energy structures), as in the exploitation strategy, into the prediction model misguided the model to estimate other structures and directly reduced its prediction ability. By contrast, the lowest-bounded MAE curve always belonged to a design that utilized a uniform sampling strategy operating on the embedding representation. By querying up to \(t = 5\), one-sixth of all not-yet-calculated structures, the design using the uniform querying strategy on embedding space quickly reached the optimal MAE at \(1.25\times 10^{-2}\) (eV/atom) and then remained at this performance level for the remainder of the experiment. The model outperforms others because the Mahalanobis metric learned using MLKR preserving both Euclidean distance and following the direction of the target function57 helps to correct the form of the estimator locally and globally. Thus, given several queried data points uniformly sampling on the embedding space, we could improve these two aspects simultaneously.

Next, we evaluated active learning designs in recalling the most potentially stable structures. We heuristically defined \(-0.1\) (eV/atom) as the upper-limit formation energy for the set of most potentially stable structures. Consequently, the ground truth of the \({\mathcal {D}}^{\mathsf {outstanding}}_{\mathsf {confirmed}}\) set contained 74 structures incorporated with formation energy lower than \(-0.1\) (eV/atom) or equivalent with 2.2% total not-yet-calculated candidates. Figure 6b shows the recall rate results for all active learning designs. The colors and patterns denoted for different active learning designs are synchronized with the MAE results, as shown in Figure 6a. This figure shows that all active learning designs recalled all \({\mathcal {D}}^{\mathsf {outstanding}}_{\mathsf {confirmed}}\) structures without querying all unlabeled structures. The worst recall performance of the active learning design by an exploration querying strategy and OFM representation required 14/30 query steps to recall all these potentially stable structures. By contrast, methods with the best recall performance required 8/30 query steps. In the naivest case, when we randomly selected a structure from an unlabeled structure data set and avoided using any structures to update all active learning components, we needed to query all not-yet-calculated structures to recall all \({\mathcal {D}}^{\mathsf {outstanding}}_{\mathsf {confirmed}}\) structures. Equivalently, the rate of recall of the top 2.2% of structures with the lowest formation energy was enhanced between 2.1 and 3.7 times compared with the basic random selection method. We also report the results of using active learning with different initialization training data in Supplemental Information Materials.

Structure–stability relationship

In this section, we discuss the structure–stability relationship of this SmFe\(_{12}\) family in detail. We investigate how different substituted elements distorted the host structure by measuring displacement of the OFM elements before and after performing a structure optimization step based on calculation from first-principles. The displacement \(\Delta (\cdot )\) was measured as \(\Delta {\mathbf {x}} = {{\mathbf {x}}}_{\text {opt}} - {{\mathbf {x}}}_{\text {org}}\) with \({\mathbf {x}}_{\text {opt}}\); \({{\mathbf {x}}}_{\text {org}}\) shows the value of an OFM element of calculated and initial structure, respectively. In Figure 7, we show correlations between formation energy and displacement OFM elements \(({\text {d}}^{6}, {M})\) in the upper panel and \(({\text {p}}^{1}, {M})\) in the lower panel, where M refers to \({\text {s}}^{2}, {\text {s}}^{1},\ldots\) Because \({\text {d}}^{6}\) represents the Fe element and \({\text {p}}^{1}\) represents the Al/Ga elements in the OFM, we focused on analyzing the change in the coordination number of Fe site and Al/Ga sites, respectively. Correlations between formation energy and other OFM elements are shown in the Supplementary Information. Here, a violin plot with blue (yellow) show the displacements of OFM elements with the mean negative (positive) of the full displacement, respectively. By contrast, the formation energy of the corresponding structures is shown in red (green) for the negative (positive) mean of the formation energy.

Figure 7
figure 7

Structure deformation–formation energy relationship of SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures. Displacement of orbital-field matrix element \({\mathbf {x}}\) of SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures after geometrical relaxation (i.e., \(\Delta {\mathbf {x}} = {\mathbf {x}}_{\text {opt}} - {\mathbf {x}}_{\text {org}}\)) represents the change of coordination number. For structures with a given \({\mathbf {x}}\), the distribution of \(\Delta {\mathbf {x}}\) with negative (positive) mean values are shown in blue (yellow), respectively, whereas distribution of formation energies are in red (green) color with respect to mean negative (positive) energy.

In the upper panel with \(({\text {d}}^{6}, {M})\), structures owning \(({\text {d}}^{6}, {\text {p}}^{1})\) and \(({\text {d}}^{6}, {\text {d}}^{2})\) , that is, Al/Ga and Ti-substituted structures, respectively, show on average negative formation energies, indicating a trend of potentially stable structures. Further, the distribution of structures owning \(({\text {d}}^{6}, {\text {d}}^{2})\) show on average a reduction in coordination number, \(({\text {d}}^{6}, {\text {p}}^{1})\) structures appear with a distribution of a positive mean value. As an interpretation, in SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\)-substituted structures, only Al/Ga-substituted sites come close to Fe sites on average (i.e., increasing coordination number). In the lower panel with \(({\text {p}}^{1}, {\text {M}})\), we confirmed again that in all Al/Ga-substituted families, there is a tendency of increasing coordination number of neighbors surrounding all \({\text {p}}^{1}\)-like OFM element (yellow violin distribution). Moreover, almost all structures with \(({\text {p}}^{1}, {M})\) exhibited a mean negative formation energy. By contrast, as shown in the Supplementary information section, structures with other OFM elements all showed decreasing trends of the average coordination number and mean positive formation energy except \(({\text {d}}^{2}, {M})\)-Ti element. The lowest mean value of formation energy belonged to \(({\text {p}}^{1}, {\text {d}}^{2})\) structures (i.e., the SmFe\(_{12-\upalpha -\upbeta }\)[Al/Ga]\(_{\upalpha }\)Ti\(_{\upbeta }\) family group).

Ideal structures in the SmFe\(_{12}\) family should meet one more qualification about maximizing the magnetization of the substituted one. In the Supplemental Information, the most potential structures are mixed between Al, Co, and Cu-substituted structures that show optimal stability and magnetization. In Figure 8, we show the non-optimized original structure SmFe\(_{12}\) compared to other Al, Co, and Cu-substituted structures after the optimization process. Three structures, SmFe\(_{10}\)Al\(_{2}\), SmFe\(_{10}\)CoAl, and SmFe\(_{10}\)CuCo are shown with formation energy lower than SmFe\(_{12}\) and sorted in increasing value of formation energy, respectively. Overall, these structures are shown with smaller sizes than the original structure SmFe\(_{12}\) and the decreasing distance at the Fe-8f site to neighbors reflects an increasing coordination number at this Fe site. In detail, structures with two Al-substituted elements, SmFe\(_{10}\)Al\(_{2}\) structure show the highest shrinkage level to the lattice parameter on the x- and y-axis while slightly expanding the lattice parameter on the z-axis. Substituting one Al and one Co site, SmFe\(_{10}\)CoAl structure obtains a smaller volume compared to the original but slightly larger than SmFe\(_{10}\)Al\(_{2}\). The largest volume among these three substituted structures belongs to SmFe\(_{10}\)CuCo. In other words, Cu- and Co-substituted sites cannot distort other Fe and Sm sites. This evidence highlights the difference between the increased coordination number of Al-substituted structures and others.

Figure 8
figure 8

The original structure of SmFe\(_{12}\) and the typical Al, Co, and Cu-substituted structures after the optimization process. SmFe\(_{10}\)Al\(_{2}\), SmFe\(_{10}\)CoAl, and SmFe\(_{10}\)CuCo have lower formation energy and smaller unit cell than that of the original SmFe\(_{12}\). The distance from the Fe-8f site to other adjacent sites (in Å) are highlighted to show the distortion caused by the substitution.

Conclusion

In this study, we have introduced a query-and-learn active learning approach in exploring SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures with \(\mathsf {X}, \mathsf {Y}\) as Mo, Zn, Co, Cu, Ti, Al, Ga, and \(\upalpha +\upbeta <4\). Our proposed method was developed to accelerate the rate of discovery of potentially stable structures and generalize our understanding of the stability mechanism of this family. 3307 SmFe\(_{12-\upalpha -\upbeta }\mathsf {X}_{\upalpha }\mathsf {Y}_{\upbeta }\) structures with formation energy and magnetism calculated using first-principles calculations were used to form the exploration space. MAE of active learning designs showed the lowest values at \(1.25\times 10^{-2}\) (eV/atom)—3.7\(\%\) of the range calculated from first-principles by utilizing the embedded descriptor originating from the OFM. Moreover, the design reached this irreducible error approximately six times faster than the alternatives compared. In the experiment aiming to find the most potentially stable structures, all active learning designs presented a successful recall rate 2.1–3.7 times faster than the random search strategy. Finally, we interpreted the formation energy landscape learned by embedding representation via smooth correlations between distributions of the local extreme and different coordination number information. We discovered that structures with substitution of non-transition-metal elements of like Al and Ga, associated with Ti, in particular SmFe\(_{9}\)[Al/Ga]\(_{2}\)Ti, had the highest possibility of stabilizing the SmFe\(_{12}\) structure. Moreover, the mean negative formation energy SmFe\(_{12-\upalpha -\upbeta }\)[Al/Ga]\(_{\upalpha }\mathsf {Y}_{\upbeta }\) structures exhibited an increasing trend of neighbor atoms surrounding Al/Ga-substituted sites on average, whereas other families showed opposite trends.