The identification of physicochemically meaningful, descriptive parameters that are correlated with catalyst performance is a key step for modeling and understanding heterogeneous catalysis and finding new and more efficient catalytically active materials. These parameters, which characterize the materials and the processes triggering, facilitating or hindering the reaction, might be called the materials genes of heterogeneous catalysis. These catalyst genes can be used to construct maps of catalysts (i.e., materials charts) that highlight the small interesting regions of the (huge) space of all possible materials, where the search for high-performance catalysts should be focused.1,2 However, finding such descriptive parameters is challenging because the outcomes of interest (e.g., product selectivity) in reactions catalyzed by solids result from the concerted and intricate interplay of several processes. These are related to the material itself but also to the reaction conditions, for instance, the temperature and gas-phase in contact with the solid. Some of these processes are: multiple bond-breaking and -forming reactions occurring on the catalyst surface, the coverage of adsorbates on those surfaces, the catalyst restructuring in the reaction environment, referred to as the catalyst dynamics,3 and the diffusion of reactants and products within the porous structure of the catalyst.3,4

One approach for describing heterogeneous catalysis is the theoretical, multi-scale modeling by first-principles simulations.4,5,6 Nevertheless, the atomistic modeling of the full catalytic progression under realistic conditions is impractical because it requires computationally prohibitive methods for the accurate evaluation of large, interconnected networks of surface reactions7,8 and complex statistical-mechanical treatments of the catalyst dynamics.9 Additionally, mesoscale (e.g., adsorbate–adsorbate) and transport phenomena need to be taken into account as well. Finally, the coupling of all these phenomena, occurring at very different time and length scales is highly complex (see References 4, 5 and references therein). While experiments, for example spectroscopic studies under reaction conditions, can point to the specific processes governing the reactivity on the particular systems under investigation, it is not obvious how to derive general and quantitative relationships between materials physicochemical properties (and reaction conditions) and the catalyst performance that go beyond the classical Sabatier principle of optimal binding strength between reacting species and the catalyst.4,10,11

In this article, we demonstrate how a tailored artificial-intelligence (AI) approach, even when applied to only a small number of materials and materials functions, but billions of quantitative materials features, can determine the key physicochemical descriptive parameters characterizing the catalyst performance. This method is used to address the challenging propane selective oxidation reaction. We start from a consistent experimental set of “clean data” containing nine vanadium-based oxidation catalysts (Figure 1a). Here, the term “clean data” refers to the fact that these materials were carefully synthesized and tested in catalysis according to standardized protocols.12 Importantly, these nine catalysts were also characterized in detail, resulting in more than 40 measured properties per material. To this data set, we applied the compressed-sensing symbolic-regression sure-independence-screening-and-sparsifying-operator (SISSO)13,14 approach (Figure 1b), as implemented in the SISSO++ code.15 We thus identified the few most relevant parameters that are correlated, in a possibly complicated way, with the selectivity toward acrylic acid and with catalyst activity.

Figure 1
figure 1

(a) Vanadium-based selective oxidation catalysts used in this work. (b) Schematic workflow of the proposed approach combining clean experiments and artificial intelligence for the identification of “materials genes” of heterogeneous catalysis. Here, GHSV means “gas hourly space velocity,” and f(id) means “function of interpretable descriptors.” The “catalyst preparation” step consists in catalyst synthesis, calcining, pressing, and sieving.

Alkane selective oxidation

The selective oxidation reaction performed with mixed-metal-oxide heterogeneous catalysts enables the transformation of abundant light-alkanes (e.g., ethane, propane and n-butane) into the valuable products olefins and oxygenates.16 However, the initial alkane might undergo multiple reactions on the surface of the catalyst in the presence of oxygen molecules (O2),7,17 leading not only to the desired molecules, but also to several byproducts, including CO2. The chemical equations describing the formation of propylene (C3H6, olefin), acrylic acid (C3H4O2, oxygenate), and CO2 (combustion or total-oxidation product) in propane (C3) oxidation, for instance, are:

$$ 2{\text{C}}_{3} {\text{H}}_{{8\left( {\text{g}} \right)}} + {\text{ O}}_{{2\left( {\text{g}} \right)}} \to 2{\text{C}}_{3} {\text{H}}_{{6\left( {\text{g}} \right)}} + 2{\text{H}}_{2} {\text{O}}_{{\left( {\text{l}} \right)}} { }\left( { - \,162{\text{ kJ}}\,{\text{mol}}^{ - 1} } \right), $$
$$ {\text{C}}_{3} {\text{H}}_{{8\left( {\text{g}} \right)}} + 2{\text{O}}_{{2\left( {\text{g}} \right)}} \to {\text{C}}_{3} {\text{H}}_{4} {\text{O}}_{{2\left( {\text{l}} \right)}} + 2{\text{H}}_{2} {\text{O}}_{{\left( {\text{l}} \right)}} \left( { - \,852{\text{ kJ}}\,{\text{mol}}^{ - 1} } \right), $$
$$ {\text{C}}_{3} {\text{H}}_{{8\left( {\text{g}} \right)}} + 5{\text{ O}}_{{2\left( {\text{g}} \right)}} \to 3{\text{CO}}_{{2\left( {\text{g}} \right)}} + 4{\text{H}}_{2} {\text{O}}_{{\left( {\text{l}} \right)}} { }\left( { - \,2220{\text{ kJ}}\,{\text{mol}}^{ - 1} } \right){.} $$

Here, the values in parenthesis are the standard reaction enthalpies.18 Selectively forming the desired products, and, in particular the oxygenate, is therefore a challenge. The “seven pillars” of oxidation catalysis indicate the several factors contributing to reactivity in oxidation reactions: (1) lattice oxygen, (2) metal–oxygen bond strength, (3) host structure, (4) redox properties, (5) multifunctionality of active sites, (6) site isolation, and (7) phase cooperation.19,20 In the case of vanadium-based oxide catalysts, selectivity has been also related to surface enrichment of one of the metal ions in the presence of reaction feed containing steam21,22,23 and the associated surface potential barrier,21,24,25 highlighting that the catalyst dynamics also plays a role. Due to the multiple requirements and the intricacy of the underlying processes, the theoretical description of selective oxidation and the search for new catalysts is extremely challenging. Alternative approaches for modeling and designing new catalysts are thus required. Here, we propose a combination of standardized experiments and AI to address this problem.

Experimental handbooks for the generation of “clean data”

The identification of reactivity descriptors by AI relies on the consistency of the input data. Therefore, we developed standardized protocols for catalyst synthesis, characterization and testing, described in experimental handbooks,12 which enable the generation of consistent and annotated data, according to the FAIR principles (Findable, Accessible, Interoperable, and Re-purposable/Re-usable).1 The establishment of minimum requirements for performing and reporting measured reactivity is a crucial aspect in heterogeneous catalysis research. Because kinetic effects play a dominant role in catalysis, the reactivity is not only sensitive to the catalyst synthesis procedure and to the resulting as-synthesized catalyst, but also to the conditions to which the material is exposed prior to and during the reaction, for instance the temperature and the composition of the gas-phase (feed) in contact with the solid.2

In this work, we focus on nine common vanadium-based oxidation catalysts (Figure 1a). These materials were prepared, in a reproducible manner, in large batches (15–20 g) to guarantee that comprehensive catalyst characterization and testing is performed using samples from the same batch. The catalyst preparation consists on the catalyst synthesis itself, plus calcining, pressing, and sieving. The materials resulting from the catalyst preparation are called fresh catalysts. After catalyst preparation, all the catalysts were tested for the C3-oxidation reaction using a fixed-bed reactor (Figure 1b). The catalyst test starts with an activation procedure during which the synthesized materials are exposed to the reaction feed and rather high temperature (e.g., 450°C) for 48 h. The activation condition is that the conversion of either propane or oxygen is 85%. The materials resulting from the activation procedure are called activated catalysts. The goal of the activation procedure is to obtain samples as similar as possible to the catalytically active materials formed during the induction period of the reaction. Indeed, some catalysts undergo structural modifications during the activation procedure. For this reason, their properties differ significantly between fresh and activated catalyst samples (see data set provided in Electronic Supporting Information, ESI).

Following the activation step of the catalyst test, the temperature is brought to 225°C in lean air and gradually increased, in steps of 25°C, in the reaction feed up to 450°C, to enable the conversion of propane and oxygen. If the propane and/or the oxygen (molar) conversion exceeds 85%, the increase in temperature is stopped to prevent catalyst decomposition. At each of the temperatures, the steady-state operation is reached and the reaction mixture at the reactor outlet is collected and analyzed, providing the measures of catalytic performance. Catalyst activity and selectivity are evaluated in terms of propane conversion (\({X}_{\text{propane}}\)) and product selectivity (\({S}_{\text{product}}\)), respectively. The propane conversion indicates the molar fraction of oxidized propane (i.e., propane converted to any of the possible products). The selectivity indicates the molar fraction of a specific product with respect to all products formed from propane. The gas hourly space velocity (GHSV), the ratio between the volumetric flow and the catalyst volume, is kept constant for all catalysts during the test (at 1000 h−1) to ensure a consistent comparison among materials. After the temperature variation, the GHSV and feed are varied and the spent catalyst is further analyzed. These steps beyond temperature variation are not discussed in this article. The raw data used here are provided as ESI and the detailed experimental procedure is explained in the handbook.12

The performance of the nine catalysts in C3-oxidation, in terms of propane conversion and acrylic acid selectivity (Figure 2), shows a wide range of behaviors across the nine selected vanadium-based catalysts. These catalysts have different activity, that is, they react with propane in different amounts, as indicated by the different propane conversions profiles (Figure 2a). MoVOx is much more active than the other catalysts and converts 58.8% of the initial propane at 300°C. At higher temperatures, MoVTeNbOx, a-VPP, VPP and V2O5, achieve conversions higher than ca. 20%, with MoVTeNbOx being the most active catalyst among them. β-VOPO4, α-VWOPO4, α-VOPO4 and VWPOx achieve significantly lower conversions (below ca. 10%), even at the highest applied temperature. These catalysts are therefore the least active materials. Several products are formed from the initial propane on each catalyst, including the value-added acrylic acid and propylene, as well as the undesirable CO2. In our analysis, we focus on acrylic acid because the formation of this product involves a complex interplay of processes. The acrylic acid selectivity measured at each temperature of the catalyst test is shown in Figure 2b. Acrylic acid is a consecutive product of the propane oxidation reaction. It is formed after propane transformation to propylene, but before the total oxidation product CO2 (see Equations 13). For this reason, its formation is only observed at intermediate propane-conversion levels (Figure 2c). MoVTeNbOx is, by far, the most selective catalyst toward acrylic acid and achieves a maximum of 71.3% selectivity at 350°C at propane conversion of 26.1 percent. MoVOx, VPP, and a-VPP reach significant, although lower, acrylic acid selectivities (7.2, 18.6, and 14.2%, respectively). The maximum selectivity for MoVOx occurs at 225°C. For VPP and a-VPP the optimal temperatures with respect to selectivity are 400 and 350°C, respectively. This is in line with the fact that MoVOx is active at lower temperatures, while VPP and a-VPP require higher temperatures to convert propane (Figure 2a). Finally, V2O5 achieves a much lower acrylic acid selectivity of 2.3 percent. The catalysts β-VOPO4, α-VWOPO4, α-VOPO4, and VWPOx are unselective toward acrylic acid under the reaction conditions considered. The presence of such diverse scenarios in the data set is crucial for the success of the AI approach. The algorithm indeed needs to be informed about materials with different performance (in particular both desirable and undesirable types of behaviors) in order to identify the reactivity patterns we are searching for.

Figure 2
figure 2

(a) Propane conversion (\({X}_{\text{propane}}\)) and (b) acrylic acid selectivity (\({S}_{\text{acrylic acid}}\)) of the vanadium-based catalysts measured in the catalyst test, evidencing the diverse types of behavior in the data set. (c) \({S}_{\text{acrylic acid}}\) dependence on \({X}_{\text{propane}}\), showing that acrylic acid, as a consecutive product, is only formed at intermediate conversion levels. The MoVOx catalyst performance was not measured at temperatures above 300°C due to its limited thermal stability in the feed. Only the catalysts that produce acrylic acid are shown in (c).

To gather information characterizing the catalysts and reflecting the potentially relevant processes governing selective oxidation within our AI approach, we measured a wide range of bulk and surface properties of both fresh and activated catalyst samples. We used the following common characterization techniques: x-ray diffraction (XRD), N2 adsorption (ads.), x-ray fluorescence (XRF), laboratory x-ray photoelectron spectroscopy (lab-XPS), and temperature-programmed reduction/oxidation (TPRO). Additionally, we measured properties of the activated catalyst samples under the reaction conditions (temperature and gas-phase feed) by the advanced techniques near-ambient-pressure XPS (NAP-XPS) and microwave cavity perturbation technique (MCPT). These advanced techniques, referred to as in situ (as opposed to the ex situ common techniques previously discussed), provide properties of the “working catalyst,” which therefore take into account the catalyst dynamics. In particular, NAP-XPS, which provides surface properties, was carried out in three different feeds: dry, wet, and C3-rich. These conditions are used to probe the influence of surface composition and its electronic properties, which depends on the feed due to catalyst dynamics, on reactivity. Regarding MCPT, it is a technique for contactless determination of the conductivity, free of electrode effects.26

Finally, we note that microscopic as well as mesoscopic properties of the catalysts are included in our analysis, which can be related to phenomena at different length (and time) scales. For instance, the surface atomic composition (from lab-XPS and NAP-XPS) might characterize a molecular-level process whereas the pore volume (from N2 ads.) may be associated to the diffusion of reactants and products on the catalyst pores, a transport phenomenon occurring at a larger length scale. The characterization of these catalysts, also performed following standardized protocols (described in Reference 12) represents an unprecedented effort to acquire a consistent and detailed set of more than 40 catalyst bulk and surface properties. An overview of the measured properties is shown in Table I (the full data set is available in ESI).

Table I Catalyst properties and reaction parameters used as primary features for the SISSO analysis.

AI approach

The identification of correlations between materials and process properties on one side and the catalytic performance toward selective oxidation on the other was done by the SISSO approach.13,14 SISSO identifies descriptors in the form of typically complex, nonlinear analytical expressions depending on input parameters, called primary features. In the machine-learning nomenclature, these descriptors are representations. Thus, SISSO is also an efficient “representation-learning” algorithm. The SISSO approach starts with the collection of the primary features. These include all possibly relevant physicochemical parameters that may relate to the processes governing the catalysis question of interest. Concerning these parameters, it is better to offer many possibilities, and it does not matter if some of these “primary features” are correlated with others. Our choice of primary features for this work is given in Table I. They correspond to the measured materials properties as well as reaction parameters such as the temperature. Altogether, these are 50. In the second step, we construct the descriptor candidates. For this, the primary features are systematically combined using mathematical operators such as addition, multiplication, difference, etc. (see details in ESI). Thus, we follow a symbolic regression approach.27,28,29 This step results in the generation of billions of descriptor candidates. Each of them provides different numerical values for the different materials and/or processes. Thus, the big-data challenge is related to the intricacy of the underlying physics and chemistry. From the large number of candidate descriptors, and using the provided values of the targets properties for the materials in the data set, SISSO selects very few, typically just \(D\) = 1, or 2, or 3 best descriptor candidates, whose linear combination, with weighting coefficients, provides the best fit to the target property. \(D\) is referred to as the descriptor dimension. The final selected descriptor is thus the vector containing the selected candidate descriptors as individual components. The selection of descriptors and the identification of the coefficients is done by compressed sensing.30,31,32 The resulting models for the target property \(P\) have the form

$$ P^{{\left( {{\text{SISSO}}} \right)}} = c_{0} + \mathop \sum \limits_{i = 1}^{D} c_{i} d_{i} , $$

where \({d}_{i}\) are the descriptor components selected from the many billions of candidates, and \({c}_{i}\) are the fitting coefficients. Importantly, only few primary features, out of the 50 offered ones, appear in the finally selected descriptor. The SISSO-derived descriptors are interpretable in the sense that one can identify the key primary features by simply inspecting the output expressions. These primary features are the relevant “catalyst genes”.

Because the functional forms of the immensity of descriptor candidates offered to the SISSO analysis are very flexible, it is important to avoid overfitting (i.e., to avoid models that fit the provided data but are not generalizable). Two parameters control the model complexity: \(D\) (see Equation 4) and the number of times the mathematical operators are iteratively applied to the features (depth of symbolic-regression tree) in order to generate the descriptor candidates, called hereafter rung, and denoted by \(q\). The model complexity is assessed using leave-one-material-out cross-validation (CV). This CV procedure consists of training models with a data set in which one of the catalysts is removed, and then using the so-obtained ensemble of best models to predict the property of the left-out material. This procedure is iterated until all the catalysts are left-out once. The root mean squared errors (RMSEs) averaged over all CV iterations (averaged CV-RMSEs) are used as our performance metric. The optimal complexity is considered the one with the lowest CV-RMSE. Further details on the CV procedure are provided in ESI.

In this work, we use the multi-task version of SISSO (MT-SISSO).14 In the context of SISSO, multi-task refers to a transfer-learning approach for the identification of a single multi-dimensional descriptor for a target property across different material classes or external conditions, each of them corresponding to different fitting coefficients (\({c}_{i}\) in Equation 4). MT-SISSO thus provides a single descriptor for the property and different models for each class of materials or external condition. In the case of this work, such different external conditions correspond to the different reaction temperatures applied in the catalyst test. Predicting the target property at each of the measured temperatures is therefore a different task. We stress that, in addition to allowing for the simultaneous modeling of the catalytic performance at different temperatures, MT-SISSO also enables us to efficiently exploit the experimental data available, increasing the effective number of data points. This is because every material is measured in a large range of temperatures, providing multiple data points per material. The multi-task approach thus improves the reliability of the identified descriptors.

In spite of the application of MT-SISSO, the number of data points in our experimental data set is very small and by no means comparable to the amount of data needed for widely used machine-learning approaches such as kernel ridge regression or artificial neural networks. The latter typically requires > 103 data points. However, we stress that for the SISSO approach a large amount of physicochemical information about the considered materials is provided by the immense number of descriptor candidates considered, with their quantitative values. The AI strategy of SISSO enables the identification of descriptors that capture the intricate underlying processes without the need for a large amount of experimentally characterized materials. The big-data aspect is thus in the intricacy of the materials functions as the signal to be reconstructed (using the language of compressed sensing). It does not focus on the number of materials (or observations). We nevertheless point out that the more experimental data are available, the more generalizable the SISSO-derived models will be. Obviously, SISSO can only capture processes that are governing the target properties in the employed experimental data.

Identifying catalyst genes of the selective C3-oxidation

To identify descriptors indicating the catalyst performance in C3-oxidation, we use the acrylic acid selectivity (\({S}_{\text{acrylic acid}}\)) as target property. With the exception of MoVOx, which was only measured at four temperatures (vide supra), nine different temperatures in the range 225–425°C are considered per catalyst. Altogether, 76 data points are used. Even though the temperature is offered as a primary feature in our analysis and it can therefore be used to construct the descriptor expression, by using MT-SISSO we are also able to capture the effect of temperature via the coefficients used to fit the models (\({c}_{i}\) in Equation 4). This is because such coefficients are functions of the task, in this case the different temperatures. Indeed, SISSO captures the temperature effect only by the fitted coefficients. The best descriptor expressions identified do not contain the temperature as a parameter (vide infra).

The errors obtained when \({S}_{\text{acrylic acid}}\) is estimated using the MT-SISSO model trained on the whole data set (i.e., the training errors), decrease as the rung \(q\) and the dimension \(D\) increase (dashed lines in Figure 3). The training RMSEs are practically zero at \(D=3\) for the three considered \(q\). This evidences the flexibility of expressions selected by SISSO to fit the input data. The average CV-RMSEs (solid lines in Figure 3), however, do not decrease monotonically with rung and dimension. Instead, the average CV-RMSEs achieve a minimum value of 6.76% at \(q=3\), \(D=2\) with respect to an optimal predictability. This is therefore the identified appropriate complexity. The training error for such model is 1.46 percent. We note that a large fraction of the CV-RMSE is associated to the CV iteration in which the MoVTeNbOx catalyst is left out. Since this catalyst achieves a much higher \({S}_{\text{acrylic acid}}\) compared to the remaining ones in the data set (Figure 2b–c), it is probably dominated by a different process and therefore it is hard to correctly predict its performance based only on the remaining materials.

Figure 3
figure 3

Cross-validation (CV) analysis of models derived by MT-SISSO for the acrylic acid selectivity. The CV errors shown correspond to the averaged root mean squared error (RMSE) across leave-one-material-out-CV iterations. The optimal complexity is \(q=3\), \(D=2\). The secondary axis (on the right) shows the CV-RMSE as a fraction of the standard deviation of \({S}_{\text{acrylic acid}}\) over the whole data set. The CV-RMSE values shown here correspond to an ensemble size of 25 descriptors (see ESI for CV details).

The best descriptor identified by MT-SISSO (i.e., the descriptor identified using the whole data set at the optimal complexity) provided the following model:

$$ \begin{aligned} S_{{\text{acrylic acid}}}^{{\left( {{\text{SISSO}}} \right)}} \left( T \right) & = c_{1}^{S} \left( T \right)\left( {\left( {V_{{{\text{fr}}}}^{{{\text{pore}}}} } \right)^{2} \frac{{W_{{{\text{rxn}},{\text{wet}}}} }}{{E_{{{\text{A}},{\text{act}}}}^{{\upsigma }} }}\frac{1}{{\left( {x_{{{\text{s}},{\text{rxn}},{\text{C}}3}}^{{\text{V}}} - x_{{{\text{s}},{\text{act}}}}^{{\text{V}}} } \right)a_{{{\text{act}}}}^{{{\text{C}} - {\text{O}}}} x_{{{\text{s}},{\text{fr}}}}^{{\text{C}}} }}} \right) \\ & \quad + c_{2}^{S} \left( T \right)\left( {V_{{{\text{fr}}}}^{{{\text{pore}}}} V_{{{\text{act}}}}^{{{\text{pore}}}} \frac{1}{{E_{{{\text{A}},{\text{act}}}}^{{\upsigma }} }}\frac{{x_{{{\text{s}},{\text{rxn}},{\text{wet}}}}^{{\text{V}}} }}{{\left( {x_{{{\text{s}},{\text{rxn}},{\text{C}}3}}^{{\text{V}}} - x_{{{\text{s}},{\text{act}}}}^{{\text{V}}} } \right)\left( {x_{{{\text{s}},{\text{rxn}},{\text{dry}}}}^{{\text{V}}} + x_{{{\text{s}},{\text{fr}}}}^{{\text{C}}} } \right)}}} \right), \\ \end{aligned} $$

where the coefficients \({c}_{1}^{S}\left(T\right)\) and \({c}_{2}^{S}\left(T\right)\) depend on the measured temperature. In this expression, \({V}_{\text{fr}}^{\text{pore}}\) and \({V}_{\text{act}}^{\text{pore}}\) are the pore volumes of the fresh and activated catalysts, respectively, \({E}_{A\text{,act}}^{\sigma }\) is the activation energy of conductivity of the activated catalysts, \({W}_{\text{rxn},\text{wet}}\) is the work function of the catalysts under reaction in wet feed, \({x}_{\text{s},\text{act}}^{\text{V}}, {x}_{\text{s},\text{rxn},\text{dry}}^{\text{V}}\), \({x}_{\text{s},\text{rxn},\text{wet}}^{\text{V}}\), and \({x}_{\text{s},\text{rxn},\text{C}3}^{\text{V}}\) are the vanadium surface content of the activated catalysts and of the materials under reaction in dry, wet, and C3-rich feeds, respectively, \({x}_{\text{s},\text{fr}}^{\text{C}}\) is the carbon surface content of the fresh catalysts, and \({a}_{\text{act}}^{\text{C}-\text{O}}\) is the fraction of surface carbon assigned to C–O in the activated catalysts. Figure 4a shows the model derived by MT-SISSO for \({S}_{\text{acrylic acid}}\) (Equation 5) evaluated for the materials and temperatures measured in the catalyst test (crosses). The agreement indicates the good quality of the fit.

Figure 4
figure 4

Descriptors identified by MT-SISSO for the acrylic acid selectivity (\({S}_{\text{acrylic acid}}\left(T\right)\)). (a) Model expression evaluated on the nine vanadium-based catalysts of the data set at the measured temperatures (crosses), showing the quality of the fit with respect to experimental values (other markers). (b) Values of the best descriptor components for each catalyst. (c) Coefficients of the best model. The same markers and colors defining the materials in Figure 2 are used in (a). The points in (c) are connected by splines (2nd order).

The SISSO-identified primary features are thus \({V}_{\text{fr}}^{\text{pore}}\), \({V}_{\text{act}}^{\text{pore}}\), \({ E}_{\text{A},\text{act}}^{\upsigma }\), \({W}_{\text{rxn},\text{wet}}\),\({ x}_{\text{s},\text{act}}^{\text{V}}, {x}_{\text{s},\text{rxn},\text{dry}}^{\text{V}}\), \({x}_{\text{s},\text{rxn},\text{wet}}^{\text{V}}\), \({x}_{\text{s},\text{rxn},\text{C}3}^{\text{V}}\), \({x}_{\text{s},\text{fr}}^{\text{C}}\) and \({a}_{\text{act}}^{\text{C}-\text{O}}\). \({V}_{\text{fr}}^{\text{pore}}\) and \({V}_{\text{act}}^{\text{pore}}\) are associated to the porous structure of the catalyst and reflect processes related to the catalyst pores, for instance diffusion of reactants and/or products. \({E}_{\text{A},\text{act}}^{\upsigma }\) and \({W}_{\text{rxn},\text{wet}}\) correspond to the activation energy of charge carrier transport and to the electronic surface potential at reaction conditions, respectively. These primary features characterize electronic properties of the working catalysts, which can be related to the charge transfer from the catalyst to adsorbed reaction intermediates. \({x}_{\text{s},\text{act}}^{\text{V}}, {x}_{\text{s},\text{rxn},\text{dry}}^{\text{V}}\), \({x}_{\text{s},\text{rxn},\text{wet}}^{\text{V}}\), and \({x}_{\text{s},\text{rxn},\text{C}3}^{\text{V}}\) indicate the relevance of the concentration of the redox-active element, vanadium, at the surface of the catalysts. Finally, \({x}_{\text{s},\text{fr}}^{\text{C}}\) and \({a}_{\text{act}}^{\text{C}-\text{O}}\) are associated to the amount and specific types of surface carbon identified by XPS. They reflect the strength of adsorption on specific catalyst surface sites and are thus related to surface site-specific processes. Altogether, the identified catalyst genes reflect a concerted and intricate interplay of catalyst bulk and surface processes that governs the selectivity toward acrylic acid. These include the catalyst dynamics, described by properties measured in situ, as well as transport phenomena at a higher length scale, encoded by the catalyst pore volume.

The key primary features (genes) identified in the SISSO analysis are consistent with previous investigations of selective oxidation catalysis in vanadium-based materials. In particular, in both descriptor components the difference (\({x}_{\text{s},\text{rxn},\text{C}3}^{\text{V}}-{x}_{\text{s},\text{act}}^{\text{V}})\) appears, which could be linked to the observed vanadium surface enrichment occurring at reaction conditions in selective oxidation catalysts.21,22,23 We note that the precise mathematical expression and the primary features individually contain less physical meaning than their collective interplay, since descriptors obtained with different mathematical operators or different primary features—for instance, correlated with those shown in Equation 5—can capture the same underlying processes.

The model identified by MT-SISSO (Equation 5) is based on a two-dimensional (2D) descriptor with components \({d}_{1}^{S}\) and \({d}_{2}^{S}\). These are different constant values for each material, and they are weighted by temperature-dependent coefficients. The descriptor components (Figure 4b) assume non-negligible values for the catalysts that produce acrylic acid: MoVOx, MoVTeNbOx, V2O5, VPP, and a-VPP. Furthermore, their values are much higher for MoVTeNbOx compared to the other materials, in line with its much higher selectivity (Figure 2b). For the catalysts β-VOPO4, α-VWOPO4, α-VOPO4, and VWPOx, both \({d}_{1}^{S}\) and \({d}_{2}^{S}\) are practically zero.

The coefficients \({c}_{1}^{S}\left(T\right)\) and \({c}_{2}^{S}\left(T\right)\) (in black and green, respectively, in Figure 4c) of the acrylic acid selectivity take up positive and negative values depending on the temperature range. \({c}_{1}^{S}\left(T\right)\) and \({c}_{2}^{S}\left(T\right)\) are positive for high and low temperatures, respectively. The signs of the coefficients change at ca. 350°C. Therefore, the selectivity is described by the model of Equation 5 as a sum of a positive and a negative term. This hints at different processes that facilitate and hinder the selectivity in a concerted and temperature-dependent manner. By fitting smooth functions to the coefficients, models for estimating acrylic acid selectivity across temperatures, including those not measured experimentally, can be obtained. Such models are useful for optimizing process conditions.

In addition to the selectivity, we also identified descriptors for the efficiency of propane oxidation, indicated by the propane conversion (\({X}_{\text{propane}}\)). The SISSO-identified model, which corresponds to the optimal predictability of \(q=1\), \(D=2\) is

$$ X_{{{\text{propane}}}}^{{\left( {{\text{SISSO}}} \right)}} \left( T \right) = c_{1}^{X} \left( T \right)\left( {\frac{{u_{{{\text{m}},{\text{fr}}}}^{{\text{O}_{2} }} }}{{W_{{{\text{rxn}},{\text{wet}}}} }}} \right) + c_{1}^{X} \left( T \right)\left( {\text{u}_{{{\text{m}},{\text{fr}}}}^{{\text{O}_{2} }} x_{{{\text{s}},{\text{rxn}},{\text{C}}3}}^{{\text{O}}} } \right). $$

In this expression, \(\text{u}_{\text{m},\text{fr}}^{{\text{O}}_{2}}\) is the reversible oxygen uptake of the fresh catalysts per mass, and \({x}_{\text{s},\text{rxn},\text{C}3}^{\text{O}}\) is the O surface content under reaction in C3-rich feed. \(\text{u}_{\text{s},\text{fr}}^{{\text{O}}_{2}}\) indicates the materials’ ability to reversibly incorporate oxygen on its bulk structure and is related, for instance, to the role of lattice oxygen. \({W}_{\text{rxn},\text{wet}}\) and \({x}_{\text{s},\text{rxn},\text{C}3}^{\text{O}}\), in turn, are related to surface processes. While the \({c}_{1}^{X}\left(T\right)\) in the model of Equation 6 is positive for all considered temperatures, \({c}_{2}^{X}\left(T\right)\) is always negative (see Figure S3C). This indicates that the processes captured by the first term in Equation 6 facilitate propane conversion whereas those associated to the second term hinder it.

Even though the 2D descriptor in Equation 6 does reflect an interplay of processes governing activity, the descriptor complexity (\(q=1\)) is lower compared to the case of acrylic acid selectivity (Equation 5, \(q=3\)). Indeed, it is expected that the selectivity toward the oxygenate depends on a more intricate interplay of processes compared to the propane conversion to any product (i.e., including CO2). We note that \({W}_{\text{rxn},\text{wet}}\) is identified as a key parameter for both properties, consistent with the fact that the acrylic acid selectivity and the propane conversion are related (Figure 2c) and might display some common governing process—and thus common materials genes.

By inspecting SISSO models predicting \({S}_{\text{acrylic acid}}\) and \({X}_{\text{propane}}\), we observe that the descriptor components \({d}_{1}\) and \({d}_{2}\) have large mutual linear correlation for the materials in the training data set. This can be formalized by noticing that a linear model \({{d}_{2}=\upalpha d}_{1}+\upbeta \) yields a good approximation of \({d}_{2}\) when \({d}_{1}\) is known. In other words, all the so far known materials lie close to a straight line in the (\({d}_{1}{,d}_{2}\)) space. Since \({d}_{1}\) and \({d}_{2}\) depend on primary features that are measured in experiments for actual materials, it is unclear if materials that would land away from the \({{d}_{2}=\upalpha d}_{1}+\upbeta \) line actually exist. However, both models are expected to become less reliable the further a new tested material lands from the \({{d}_{2}=\upalpha d}_{1}+\upbeta \) line. Should this happen for a new tested material, the model would need retraining as a more complex model is likely needed. This can also be realized when noticing that the linear models \({X}_{\text{propane}},{S}_{\text{acrylic acid}}={c}_{1}{d}_{1}+{c}_{2}{d}_{2}\) can predict values of \({X}_{\text{propane}}\) and \({S}_{\text{acrylic acid}}\) outside the physically meaningful interval 0–100%, for arbitrary values of \({d}_{1}\) and \({d}_{2}\), which are different from the (\({d}_{1}{,d}_{2}\)) values that represent the materials in the data set. We also notice that for \({X}_{\text{propane}}\), the model is particularly sensitive when \({d}_{2}\) departs from the \({d}_{2}^{X}={\upalpha }^{X} {d}_{1}^{X}+{\upbeta }^{X}\) line, which limits its applicability. A deeper analysis of this model will be published elsewhere.

Maps of catalysts for guiding the design of new materials

We used the relationship \({d}_{2}^{S}={\upalpha }^{S} {d}_{1}^{S}+{\upbeta }^{S}\) to obtain a “map of catalysts” (Figure 5) showing \({S}_{\text{acrylic acid}}^{\left(\text{SISSO}\right)}\left(T\right)\) as a function of two variables, the materials descriptor \({d}_{1}^{S}\) and the temperature. In the expression \({d}_{2}^{S}={\upalpha }^{S} {d}_{1}^{S}+{\upbeta }^{S}\), \({\upalpha }^{S}\) and \({\upbeta }^{S}\) are fitted parameters, with values 2.948 × 10–3 eV−1 and − 9.246 × 10–4 g2 mol cm−6 kJ−1, respectively. The resulting map shows the selectivity, as a color scale from 0 to 100%, for the temperature range used in the experiment. Every material is represented, in this plot, by a horizontal line and the black lines indicate the materials in the experimental data set that produce acrylic acid and were used for training the model.

Figure 5
figure 5

Map of catalysts given by the MT-SISSO model for the acrylic acid selectivity \({S}_{\text{acrylic acid}}^{\left(\text{SISSO}\right)}\left(T\right)\), indicating the regions of the materials space corresponding to high selectivity (in blue). The materials used for deriving the descriptor that produce acrylic acid are indicated by the black lines. The green dashed line indicates a hypothetical VPPmodified catalyst. VPPmodified would be obtained by increasing 50% VPP pore volumes (in both fresh and activated materials) and by decreasing 50% of its activation energy of conductivity. The splines shown in Figure 4c are used to interpolate \({c}_{1}^{S}\left(T\right)\) and \({c}_{2}^{S}\left(T\right)\) across temperatures.

The map of Figure 5 highlights the different types of behavior present in the data set used for the derivation of the descriptor. In particular, it shows that the temperature of maximum acrylic acid selectivity decreases as one moves from low to high \({d}_{1}^{S}\) values. The map also evidences the unique and higher performance of MoVTeNbOx compared to the other materials. Additionally, this materials chart can accelerate the design of new catalysts, since it indicates the regions of the materials space where high-performance materials are found. In particular, catalysts with high \({d}_{1}^{S}\) values (regions shown in blue in Figure 5) are associated to high selectivity toward the formation of acrylic acid. From Equation 5, the pore volumes and the activation energies of conductivity are key properties that impact the value of \({d}_{1}^{S}\). By increasing pore volume and decreasing activation energy of conductivity, for instance, \({d}_{1}^{S}\) is increased. To illustrate how materials with “better” parameters will exhibit a better selectivity, we imagine a hypothetical catalyst, which would be obtained from VPP, increasing its pore volume by 50% (in both fresh and activated materials) and decreasing its activation energy of conductivity by 50 percent. The VPPmodified material, shown as the green dashed line in Figure 5, would provide an acrylic acid selectivity ca. three times higher than that for VPP. For this gedankenexperiment, we assume that all other primary features of the catalyst remain unchanged.

Finally, we note that the models derived by MT-SISSO are expected to hold for catalysts obtained as modification of the vanadium-based catalysts as well for other (new) materials as long as the reactivity is governed by the same processes governing the catalytic performance in the nine materials of our data set. However, for regions of the materials space containing catalysts different from those in the data set, the MT-SISSO model might need to be retrained with more data. This calls for a systematic experimental exploration of the materials space, i.e., systematically adding information at \({d}_{1}\), \({d}_{2}\) regions where experimental data are missing at present.

In this work, the primary features to model the measured catalytic performance via AI were entirely derived from experiment. As previously discussed, the explicit first-principles modeling of the full catalytic progression in complex reactions such as propane oxidation is unfeasible. However, in addition to the experimental inputs, primary features derived from theory could be included in the AI description of the measured reactivity. This will help to identify and understand the microscopic processes that govern catalysis. Such study will be addressed in a follow-up work.


Our study shows how consistent data in heterogeneous-catalysis research, generated according to standardized protocols for performing and annotating experiments,12 enable the identification of the key descriptive parameters related to catalyst performance, the “materials genes of catalysis,” by AI. Nine vanadium-based alkane oxidation catalysts presenting diverse reactivity toward C3-oxidation were synthesized, characterized, and tested according to such procedures. In particular, their detailed characterization resulted in more than 40 measured properties per material. To such data set, presenting a small number of materials, but a large amount of information for each catalyst, provided in terms of input features, we applied the compressed-sensing symbolic-regression SISSO approach. Out of billions of descriptor candidates, we found nontrivial interpretable expressions reflecting the concerted interplay of processes that govern catalysis, including the crucial catalyst dynamics. The AI-identified descriptors enable us to generate maps of catalysts for guiding the search of novel materials and rationalizing the reactivity trends. In particular, the key catalysts properties related to acrylic acid selectivity include the pore volume, the activation energy of conductivity, the work function, the fraction of surface carbon species assigned to carbon–oxygen as well as the vanadium and carbon surface contents. These properties, measured by N2 adsorption, in situ MCPT and XPS (including in situ NAP-XPS), are thus the key ones to be measured and used for the design of selective materials.

The combination of systematic experiments and AI proposed here is suitable for improved materials discovery and the modeling of complex materials properties and functions whose underlying governing processes are intricate and hard to model explicitly by atomistic simulations.

Supplementary material

SISSO and CV details as well as additional results for the descriptor \({X}_{\text{propane}}^{\left(\text{SISSO}\right)}\left(T\right)\) are available in ESI. The data generated and analyzed during this study are included as supplementary information files.

The SISSO analysis described in this publication can be found in a Jupyter notebook at the NOMAD Artificial-Intelligence Toolkit (, where it can be repeated and modified directly in a web browser.