Introduction

Computer-aided synthesis planning (CASP)  [1], which aims to assist chemists in synthesizing new molecule compounds, has been rapidly transformed by artificial intelligence methods. Given the availability of large-scale reaction datasets, such as the United States Patent and Trademark Office (USPTO) [2], Reaxys [3], and SciFinder [4], CASP has become an increasingly popular topic in pharmaceutical discovery and organic chemistry with many impressive breakthroughs achieved [5]. The current CASP systems can be divided into two critical aspects, retrosynthetic planning and forward-reaction prediction [6]. Retrosynthetic planning, including template-based and template-free methods, can help generate possible synthetic routes of target molecules [7]. Forward-reaction prediction is mainly used to evaluate the strategies proposed by retrosynthetic planning and increase the likelihood of experimental success [8]. However, without considering reaction yield or reaction conditions, the synthetic strategies proposed in the CASP systems would be difficult to be implemented. It still remains a big challenge to predict the reaction yield. Due to the complexity of chemical experiments, few solid theories can help predict the reaction yield of a new chemical reaction given a specific condition, let alone optimize a reaction condition, which heavily depends on expertise, knowledge, intuition, numerous practices, extensive literature reading and even the luck of chemists [5, 9].

Some pioneer efforts have been contributed to predict the reaction yield, and then find the optimal reaction condition. Note that the optimal reaction selection problem can be naturally treated as a classical out-of-distribution (OOD) problem, since the optimal reaction is often not included in the training set. Ahneman et al. [10] reported that the random forest model achieved the best performance on OOD yield prediction due to its good generalization ability. Zuranski et al. [11] reviewed and examined the OOD performance of different machine learning algorithms and reaction embedding techniques. Dong et al. [12] used the XGBoost model and achieved satisfactory OOD performance. Zhu et al. [13] demonstrated that regression-based machine learning had great application potential in OOD yield prediction.

[R2–4] Studying the experimental results in previous work, we found that the predicting performance will deteriorate dramatically when there exists relatively large difference between training and testing data. For instance, in the experiments of [10, 14, 15], when the testing data do not contain any new reagents that are different from the training set (testing data is randomly selected from the whole dataset, and the rest of the data is used as training set), the \(R^{2}\) of random forest model is 0.92. When the testing data includes new additives that are not contained in the training data (testing data includes reactions with some additives, while training data includes reactions with other additives), the \(R^{2}\) of random forest model will drop to 0.19 in the worst case (the size of training set is almost same). This performance deterioration problem will be very common when using yield prediction model to explore new reaction chemical space, as the size of unknown chemical space to be predicted can be huge. Enlarging training set with huge amount of data may solve this performance deterioration problem, but it is not practical due to the high cost of experimental data and huge size of unknown chemical space.

In this paper, we follow a more relaxed but practical setting, where we are allowed to add a few data of new reagents or conditions into the training set. Considering the limited amount of reaction condition data, few-shot yield prediction has great potential in solving this problem. Few-shot yield prediction adds very few reaction samples(e.g. around five samples) from new reagents or conditions into training data. It is reasonable to hypothesize that using data of a new reagent can improve prediction results. Questions yet to be explored are how to use these new samples, which sample to select, and how much data from the new reagent leads to a satisfactory result.

To bridge this gap, we proposed MetaRF, an attention-based random forest model with a meta-learning technique applied to determine attention weights adaptively. The random forest has been proved as an ensemble method with outstanding performance on datasets with small sample size [16, 17]. Since the size of reaction condition datasets are relatively small (e.g. 781 reactions in Buchwald-Hartwig electronic laboratory notebooks dataset [18]), random forest models have shown excellent performance on reaction yield prediction task and outperformed other machine learning approaches [10, 11, 19]. Few-shot learning techniques, such as meta-learning, have great potential in helping chemists explore the new reaction chemical space. However, the structure of random forest is non-differential, which is hard to combine with the gradient-based techniques in few-shot learning. Thus random forest cannot be directly optimized by few-shot learning techniques. To solve this problem and achieve robust performance on new reagents, we design a attention-based random forest, adding attention weights to the random forest through a meta-learning framework, Model Agnostic Meta-Learning (MAML) algorithm [20]. The key idea of MAML is to train the model’s initial parameters so that the model can quickly adapt to a new task after the parameters have been updated through a few gradient steps computed with few-shot data from that new task [20]. MAML is applied to determine the attention weights of decision trees in the random forest so that the model can quickly adapt to predict the performance of new reagents using few-shot training samples. In our method, Density Functional Theory [10] (DFT) descriptor is used to represent molecules due to its enhanced interpretability and feature generalization ability in yield prediction task.

Besides, the choice of few-shot training samples also has a significant influence on model performance. Few-shot learning can have better-predicting performance if it is allowed to choose the training samples [21]. To tackle this challenge, we use Kennard-Stone (KS) algorithm [22] to select the most representative samples which cover the experimental space homogeneously. Since the KS algorithm is based on Euclidean distance, which suffers from the curse of dimensionality [23], T-distributed stochastic neighbor embedding (TSNE) [24] is applied for unsupervised nonlinear dimension reduction.

Our methodology is comprehensively evaluated on Buchwald Hartwig high-throughput experimentation (HTE) dataset [10], Buchwald-Hartwig electronic laboratory notebooks (ELN) dataset [18], as well as Suzuki Miyaura HTE dataset [25]. In Buchwald-Hartwig HTE dataset, our method achieves \(R^{2}\)=0.648 using 2.5% of the dataset as the training set. To reach a comparable result, the baseline method (random forest) needs to use at least 20% of the dataset as the training set. With the help of 5 additional samples, our method can effectively explore unseen chemical space and select high-yield reactions. The 10 reactions, which are predicted to have the highest yield, reach an average yield of 93.7%, relatively close to the result of ideal yield selection (95.5%). In contrast, the top 10 high-yield reactions selected by the baseline method have an average yield of 86.3%, and the average yield of random selection is 52.1%.

The overview framework of this research is presented in Fig. 1. More details of methodology are in Section-Methods. The methodology in this paper can predict the effect of a new reagent structure with few reaction data, and our sampling method can help chemists choose the order of experiments.

Fig. 1
figure 1

Workflow of this research that includes reaction encoding, dimension-reduction based sampling method, and attention-based random forest model. Buchwald-Hartwig HTE dataset is taken as an example

Methods

Reaction encoded with DFT

Density Functional Theory (DFT) descriptor is widely used in molecular embedding owing to its strong and effective feature generalization ability [26]. Previous research [11] shows that the DFT descriptor provides transferable chemical insight and sheds light on the underlying mechanism. Compared with molecular fingerprints and various learned representations, DFT descriptor is more closely associated with physical and chemical attributes of molecules, thus providing enhanced interpretability and mechanistic understandings [27]. Using DFT descriptor, chemists can draw insights about the feature importance of each atom and each functional group. Previous work [19] on experimental comparison also shows that DFT descriptor outperforms RDKit’s (a cheminformatics tool) chemical reaction fingerprints [28] and deep-learning RXNFP method [29] on yield prediction task. Thus we use DFT descriptors to represent molecules in our experiments.

We followed the DFT descriptor calculation in [10], which includes molecular, atomic, and vibrational property descriptors. As in [10], we generate the numerical encoding of each reaction by concatenating the DFT descriptor of each chemical component. For example, the encoding of experiment i in Buchwald-Hartwig reaction is

$$\begin{aligned} {{{x}}}_{i}={{{x}}}_{\mathrm{{Aryl\text { }halide}}}\oplus {{{x}}}_{\mathrm{{Pd\text { }catalyst}}}\oplus {{{x}}}_{\mathrm{{Additive}}}\oplus {{{x}}}_{\mathrm{{Base}}} \end{aligned}$$
(1)

where \(\oplus\) denotes concatenation and \({{{x}}}_{\mathrm{{Aryl\text { } halide}}}\), \({{{x}}}_{\mathrm{{Pd\text { } catalyst}}}\), \({{{x}}}_{\mathrm{{Additive}}}\), \({{{x}}}_{\mathrm{{Base}}}\) denotes DFT descriptor vector of the corresponding Aryl halide, Pd catalyst, Additive and Base.

MetaRF: attention-based random forest

Although the random forest is a robust algorithm in yield prediction, it remains a challenge to combine random forest with few-shot learning techniques in yield prediction of new reagents or conditions. Meta-learning introduces a model that can quickly adapt to new tasks with few additional samples. Model Agnostic Meta-Learning (MAML) framework [20] is a well-known meta-learning approach with both simplicity and effectiveness. However, the non-differential characteristic of the random forest makes it difficult to integrate with the gradient-based meta-learning framework. To tackle this problem, we solve different attention weights to decision trees in the random forest using MAML framework, which consists of a meta-training phase and a few-shot fine-tuning phase.

To explore the OOD predicting ability, the testing set and validation set must include at least one unseen reagent in the training set. For example, among the 22 different additives in Buchwald-Hartwig HTE dataset, 4 additives are used for training, 1 additive is used for validation, and 17 for testing. In this way, the training set and validation set take 22.7% of the dataset. To further reduce the size of the training set, we use the sampling method in Section-Dimension-reduction based Sampling Method. The random forest model is trained on the reduced training set.

In the random forest model, forest \(\mathcal {F}\) is a collection of decision trees:

$$\begin{aligned} \mathcal {F}(\Theta )=\{{{h}_{m}}({\textbf{x}};{{\Theta }_{m}})\},m=1,2,\ldots M \end{aligned}$$
(2)

where M is the total number of decision trees, \(\Theta =\{{{\Theta }_{1}},{{\Theta }_{2}},\ldots {{\Theta }_{M}}\}\) represents parameters in \(\mathcal {F}\), which includes splitting variables and their splitting values. \(\mathcal {F}\) is fitted by the training data \(\mathcal {L}=\left\{ ({{x}_{1}},{{y}_{1}}),\cdot \cdot \cdot ({{x}_{N}},{{y}_{N}}) \right\}\), where \({{x}_{i}}\) is the embedding of reaction i (defined in the former section) and \({{y}_{i}}\) represents the yield of the reaction.

The decision tree is a simple predictive model. It has the form

$$\begin{aligned} {{h}_{m}}(x)=\sum \nolimits _{j=1}^{J}{{{b}_{jm}}}I(x\in {{R}_{jm}}) \end{aligned}$$
(3)

where J is the number of its leaves. The tree partitions the input space into J disjoint regions \({{R}_{1m}},\ldots ,{{R}_{Jm}}\) and predicts a constant value in each region. \({{b}_{jm}}\) is the value predicted in \({{R}_{jm}}\).

At each tree node, part of the variables are randomly selected as a subset. The splitting variable is chosen from this subset. This random selection of features at each node decreases the correlation between the trees in the forest and thus reduces the error rate of the random forest.

Concating the results of each decision tree \({{h}_{m}}(x)\), we have

$$\begin{aligned} {{x}_{i}}^{\prime }=\left[ \begin{array}{ll} \begin{array}{ll} \begin{array}{ll} {{h}_{1}}({{x}_{i}}) \\ \end{array} \\ {{h}_{2}}({{x}_{i}}) \\ \vdots \\ \end{array} \\ {{h}_{M}}({{x}_{i}}) \\ \end{array} \right] \end{aligned}$$
(4)

Then we assign attention weights to the results of each decision tree \({{x}_{i}}^{\prime }\). In this step the parameters inside these decision trees will not be changed. The attention weight of each decision tree will be updated through a meta-training phase and a few-shot fine-tuning phase.

Fig. 2
figure 2

A Meta-training Phase. B Few-shot fine-tuning Phase. C Testing Phase

[R2–2]In the meta-training phase (illustrated in Fig. 2A), MAML provides a good initialization of parameters in deep networks. Assume \(\theta\) is the parameters that need to be optimized and \({f}_{\theta }\) is the parametrized function. In each training iteration, the updated \(\theta\) is computed using one gradient descent update on task \({T}_{i}\), and the loss function is computed using the updated \(\theta\). Sampling task \({T}_{i}\) includes two steps. An additive \(S_i\) is randomly sampled from the training additive set. Then K reactions with additive \(S_i\) are randomly sampled to form task \({T}_{i}\). More concretely, the loss function is defined as follows:

$$\begin{aligned} \underset{\theta }{\mathop {\text {min}}}\,\text { }\sum \limits _{{{T}_{i}}\sim T}{{{L}_{{{T}_{i}}}}({{f}_{\theta -\alpha {{\nabla }_{\theta }} {{L}_{{{T}_{i}}}}({{f}_{\theta }})}})} \end{aligned}$$
(5)

where L is the mean square error between the prediction \({f}_{\phi }({{x}_{j}}^{\prime })\) and true value \({y}_{j}\) in task \({T}_{i}\).

$$\begin{aligned} {{L}_{{{T}_{i}}}}({{f}_{\phi }})=\sum \limits _{({{x}_{j}}^{\prime },{{y}_{j}})\sim {{T}_{i}}}{\left\| {{f}_{\phi }}({{x}_{j}}^{\prime })-{{y}_{j}} \right\| _{2}^{2}} \end{aligned}$$
(6)

\({{x}_{j}}^{\prime }\) represents the list of decision tree values in former section. As in Finn et al. [20], the regressor \({f}_{\theta }\) is a neural network with 2 hidden layers of size 40 with ReLU nonlinearities. During training, Equation (5) is minimized using gradient descent algorithm Adam [30] to acquire the parameter \(\theta _{{\mathrm {meta-train}}}\).

[R2–2]In the few-shot fine-tuning phase (illustrated in Fig. 2B), the model is fine-tuned with a few samples from each testing additive. One iteration of gradient descent is performed to achieve \(\theta _{{\mathrm {few - shot}}}\) suitable for the new task \({T}_{test}\):

$$\begin{aligned} \theta _{{\mathrm {few - shot}}} = \theta _{{\mathrm {meta-train}}} -\alpha {{\nabla }_{\theta _{{\mathrm {meta-train}}}}} {{L}_{{{T}_{test}}}}({{f}_{\theta _{{\mathrm {meta-train}}}}}) \end{aligned}$$
(7)

For each additive in the testing set, the fine-tune sample in \({T}_{test}\) is selected using dimension-reduction based sampling method in Section-Dimension-reduction based Sampling Method. The number of fine-tune samples is altered in our experiments.

Dimension-reduction based sampling method

For the few-shot learning problem, the few-shot training samples have a significant influence on the training performance. If we preferentially select the most representative samples as training samples, the performance of few-shot learning can be dramatically improved [31]. We use Kennard-Stone (KS) algorithm [22] to select the most representative samples by selecting a new sample that has relatively large distances from previously selected samples. However, the KS algorithm uses Euclidean distance to represent the distances between samples, which is less effective in the high-dimensional reaction data [32]. Thus we propose to add T-distributed stochastic neighbor embedding (TSNE) [24] before the KS algorithm to reduce the dimension of reaction data. TSNE is a widely used unsupervised nonlinear dimension reduction technique owing to its advantage in capturing local data characteristics and revealing subtle data structures [24, 33, 34].

Figure 3 use the “Swiss roll” dataset as an illustrating example for the effect of nonlinear dimension reduction method [35]. Figure 3A shows that Euclidean distance in the high-dimensional input space may not reflect the true low-dimensional geometry of the manifold. Figure 3B show the sampling result of KS algorithm without nonlinear dimension reduction method. KS algorithm is based on Euclidean distance and does not sample the central area. Figure 3C show the sampling result of KS algorithm when the dimension of data is reduced to two. KS algorithm will sample the central area after nonlinear dimension reduction. This example shows that nonlinear dimension reduction method can help our sampling method explore the intrinsic geometry of the data.

Given a set of high-dimensional reaction embedding data \({{x}_{1}},{{x}_{2}},\ldots ,{{x}_{N}}\), TSNE will map the data to low dimension, while retaining the significant structure of the original data [24, 36]. It is based on probabilistic modeling of data points in the original space and the projection space [37].

Fig. 3
figure 3

Example of nonlinear dimension reduction on the “Swiss roll” dataset. A Euclidean distance (dashed line) in the high-dimensional input space may not reflect the true low-dimensional geometry of the manifold (Solid line). B Using KS algorithm to select the most representative samples on the high-dimensional input space, samples in the central area (dashed square) will not be selected. C Using KS algorithm after the nonlinear dimension reduction, samples in the the central area (dashed square) will be selected

The TSNE algorithm is based on the SNE framework [38], which converts high-dimensional Euclidean distances into conditional probabilities, representing similarities for every data pair. Typically the gradient descent technique is used for optimization.

After the high-dimensional reaction embedding data \({{x}_{1}},{{x}_{2}},\ldots ,{{x}_{N}}\) is mapped to the low-dimensional data \({{z}_{1}},{{z}_{2}},\ldots ,{{z}_{N}}\), Kennard-Stone (KS) algorithm is used to select the few-shot training samples in low-dimensional space. KS algorithm is a well-known method to select the most representative samples from the whole dataset [22, 39, 40]. The algorithm aims at choosing a subset of samples that cover the experimental space homogeneously [41]. First, the Euclidean distance between each pair of samples is calculated, and a pair of samples with the largest distance is chosen. Then the following samples are selected sequentially based on the distances to the already selected samples. The remaining sample with the largest distances is chosen and added to the subset. This procedure is repeated until a certain number of samples are selected.

From a chemical perspective, our dimension-reduction based sampling method can explore the intrinsic geometry of chemical structure and properties contained in the DFT descriptors. KS algorithm can distinguish the discrepancies and select representative samples with very different chemical structures and properties, which may shed light on the design of chemical experiments.

Fig. 4
figure 4

A Buchwald-Hartwig HTE dataset. B Buchwald-Hartwig ELN dataset. C Suzuki-Miyaura HTE dataset

Results

Performance benchmarking

Fig. 5
figure 5

Comparison of test set performance of MetaRF and baseline on three datasets. \(R^{2}\) performance increases gradually as the size of training data increases. MetaRF outperforms the baseline with markedly fewer training samples

Fig. 6
figure 6

Average and standard deviation of the yield for the top 10 reactions predicted to have the highest yields

We evaluate our method with Buchwald-Hartwig electronic laboratory notebooks (ELN) dataset [18], Buchwald-Hartwig HTE dataset [10] and Suzuki-Miyaura HTE dataset [25]. Buchwald-Hartwig HTE dataset is the HTE results of the Pd-catalysed Buchwald-Hartwig cross-coupling reaction. This dataset consists of 3955 reactions as shown in Fig. 4A, and the reaction space is the combination of 15 aryl halides, 4 Buchwald ligands, 3 bases, and 22 isoxazole additives. Buchwald-Hartwig ELN dataset disclosed a real-world dataset from electronic laboratory notebooks (ELN) at AstraZeneca. The dataset covers a large reaction space. 340 aryl halides, 260 amines, 24 ligands, 15 bases, and 15 solvents should have covered \(4.7*10^{8}\) possible combinations. While in fact, the dataset only includes 781 reactions in Fig. 4B, resulting in a rather sparse coverage. HTE dataset greatly differs from the ELN dataset in the coverage of chemical space and characteristics. HTE dataset covers the entire search space of reaction condition while ELN dataset has a sparse coverage of wider chemical space. We also evaluate our methodology on Suzuki-Miyaura HTE dataset [25] to show that our methodology can be easily adapted to other reactions. Suzuki-Miyaura reaction means that aryl halide reacts with an organoboron compound to form a new C-C bond in the presence of Pd catalyst, ligand, and base. The mechanism of the Suzuki-Miyaura reaction is close to the Buchwald-Hartwig reaction, they all include an oxidative addition step and reductive elimination step in the catalytic cycle mechanism. The dataset includes 15 pairs of electrophiles and nucleophiles(\(R_{1a-d}\) with \(R_{2a-c}\) and \(R_{1e-g}\) with \(R_{2d}\)), 12 ligands, 8 bases, and 4 solvents, resulting in 5760 reactions in Fig. 4C. Experiments show that our methodology possesses outstanding performance on few-shot yield prediction.

Fig. 7
figure 7

The prediction results on different additives. For each additive in the testing set, the predicted yield and observed yield is presented in a subplot. The title of each subplot is the index number of the additive

Fig. 8
figure 8

Most important DFT descriptors of the model trained on different sizes of data. Feature importance is determined by the decrease in \(R^{2}\) upon reshuffling the values of the feature. * indicates a shared atom. E indicates energy; HOMO indicates the highest occupied molecular orbital; V indicates vibration

Table 1 Method comparison results on Buchwald-Hartwig HTE dataset
Table 2 Method comparison results on Suzuki-Miyaura HTE dataset
Table 3 Method comparision results on Buchwald-Hartwig ELN dataset

Tables 1,2,3 shows the performance comparision results in Buchwald-Hartwig HTE dataset, Suzuki-Miyaura HTE dataset and Buchwald-Hartwig ELN dataset, respectively. For a fair comparison, we enlarge the training set of other method with the additional fine-tune samples to guarantee that our method shares the same quantity of training data as other method. Our method has outstanding performance for few-shot yield prediction task in all three datasets. For example, in Buchwald-Hartwig HTE dataset, our method reaches \(R^{2}\)=0.7738 while the \(R^{2}\) of the random forest [10, 16], DRFP [42], RXNFP [43, 44], neural network [45], support vector machine [46], linear model [47] and GemNet [48] are 0.6538, 0.6470, 0.0032, 0.6179, 0.5322, 0.5928, 0.5245, respectively. Among other methods, random forest has relatively good performance, which is consistent with results in previous research [10, 11, 19]. Thus random forest is chosen as the baseline method in the following analysis.

Experiments show that our method has satisfying performance when the size of training data is relatively small. As shown in Fig. 5, our model outperforms the baseline method in all three datasets when the size of training data increases gradually. Experiments show that our method possesses enhanced predictive power with markedly fewer training samples, which means that our method is an effective tool in few-shot yield prediction problem. For example, when trained on only 2.5% of Buchwald-Hartwig HTE data, MetaRF acquires comparable results with the baseline method using 20% of the same reaction data. 2.5% of the Buchwald-Hartwig HTE data includes only 90 reactions. Using 2.5% of the data as the training set, our method reaches \(R^{2}\)=0.648 while the \(R^{2}\) of the baseline method is 0.571. When the training set increases to 20% of the data, the \(R^{2}\) of the baseline method is only 0.654. This comparison is similar to the results on Buchwald-Hartwig ELN and Suzuki-Miyaura HTE datasets. These results indicate that our method has great application potential in few-shot yield prediction. In our experiments, 80 training iterations are performed, and we use one gradient update with \(K = 40\) examples and learning rate \(\alpha = 0.0001\). More details about the splitting of the training set, validation set, and testing set are in Section -MetaRF: Attention-based Random Forest..

Table 4 The results of ablation study(in Buchwald-Hartwig HTE dataset)
Table 5 The relative improvement of MetaRF compared to the baseline method(in Buchwald HTE dataset)

Then we test our method on the ability to search for reactions with the highest yield. This ability is valuable because it helps chemists explore unseen chemical space and select high-yield reactions [49, 50]. We train our models with a relatively small training set (2.5% of the Buchwald-Hartwig HTE data, 5% of the Suzuki-Miyaura HTE data, 50% of the Buchwald-Hartwig ELN data) and use them to predict the yields of the remaining reactions. The top 10 high-yield reactions are selected according to the prediction results. Then we calculate the average and standard deviation of 10 high-yield reactions. Figure 6 presents the average and standard deviation of the yields for the top 10 reactions predicted to have the highest yields in the three datasets. Besides our method and baseline method, the result of ideal reaction selection and random reaction selection are presented. In all three datasets, our method has a higher average yield and lower standard deviation than baseline selection and random selection. For example, in the Buchwald-Hartwig HTE dataset, using MetaRF trained on 2.5% of the dataset, the predicted top 10 high-yield reactions from the remaining dataset have an average yield of 93.7±4.1%, compared to the ideal selection of 95.5±1.8%. In contrast, baseline selection has an average yield of 86.3±5.4% and random selection has an average yield of 52.1±24.9%. The selection works similarly for the Buchwald-Hartwig ELN and Suzuki-Miyaura HTE dataset.

Ablation study

To validate the effects of each component in MetaRF, we conduct an ablation study on the Buchwald-Hartwig HTE dataset, with 20% of the data as the training set. The number of fine-tune samples is five in the ablation study. For the baseline method (random forest), five fine-tune samples are randomly selected and then added to the training set.

The first ablation replaces the dimension-reduction based sampling with random sampling. The random sampling experiment is repeated 10 times, and average performance is recorded. The second ablation removes the random forest structure, using MAML to replace the MetaRF framework. The third ablation keeps the random forest structure and uses a standard pretraining and fine-tuning framework in transfer learning [20] to replace MAML.

Table 4 presents the comparison results of predicting performance in terms of \(R^{2}\) and RMSE. When dimension-reduction based sampling is replaced with random sampling, the \(R^{2}\) decreases from 0.7738 to 0.7003, demonstrating the effectiveness of the dimension-reduction based sampling method. The results of the ablation study also clearly demonstrate the importance of random forest structure in MetaRF. Removing random forest causes \(R^{2}\) performance to decrease from 0.7738 to 0.3730, which shows that random forest can tackle the overfitting problem in few-shot prediction. Regarding the results of the third ablation test, \(R^{2}\) decreases by 10% when MAML is replaced with transfer learning, and transfer learning has minor improvement compared to the baseline.

Analysis on fine-tune sample number

The time-consuming chemical experiments raise the cost of new reaction yield data. Thus the few-shot setting and the specific number of fine-tune samples is very important in reducing the cost of empirical screening. We analyze the effect of adjusting the number of fine-tune samples on the Buchwald-Hartwig HTE dataset [10], using 20% of the data as the training set. The few-shot yield predicting ability is tested by root mean square error (RMSE) and \(R^{2}\) performance.

When the number of fine-tune samples is 5, we obtain an 18.82% relative improvement in the \(R^{2}\) performance and an 19.25% relative improvement in the RMSE performance. Our method reaches \(R^{2}\)=0.7738 and RMSE=12.6401, while the \(R^{2}\) and RMSE of baseline method (random forest) is 0.6538 and 15.6535, respectively. More evaluation results of relative improvement are listed in Table 5. When the number of fine-tune samples varies, the RMSE relative improvement is still around 20%, which demonstrates the stable and satisfactory performance of MetaRF on few-shot yield prediction.

Predicting performance on each additive

For Buchwald-Hartwig HTE dataset, when using 20% of the data as the training set, the predicting performance of each additive in the testing set is shown in Fig. 7. In this experiment, the number of fine-tune samples is 5. For each additive, the predicted yield and observed yield are presented in a subplot. From Fig. 7, we can see that our model has satisfactory performance on new additives in the testing set, which shows that our model can quickly adapt with only 5 data points.

Interpretability analysis

For interpretability analysis, we visualize the most important DFT (Density Functional Theory) descriptors in the model trained on different sizes of Buchwald-Hartwig HTE data in Fig. 8. One measure of feature importance is the decrease in the model’s \(R^{2}\) performance when the values of that feature are randomly shuffled, and the model is retrained [10]. The feature importance results of models trained on different sizes of data have a slight difference. Generally, the most important descriptors are aryl halide’s *C3 nuclear magnetic resonance (NMR) shift (the asterisk indicates a shared atom), aryl halide’s vibration frequency, additive’s *C3 NMR shift and additive’s *C3, *O1, *C4 electrostatic charges.

Discussion

The advantage of our method is that it can quickly adapt to predict the yield performance of new reagents while few additional samples are given. The underlying mechanism of this advantage is the adaption of the DFT feature importance. When MetaRF is fine-tuned with few additional samples, the importance of each decision tree model will change accordingly. Different tree models represent different distribution of DFT feature importance. Thus our model can change the DFT feature importance according to few reaction samples from the new reagent.

A limitation of our method is that, it relies on historical data of the same reaction to train the model. According to the experimental results in Section-Performance Benchmarking, the training set should contain about 90 samples from the same reaction. In the actual usage, the reaction yield data from chemical literature may acts as training set. Considering the huge amount of reaction types, a possible future direction is cross-reaction prediction, which use some reaction data to predict the yield of another type of chemical reaction.

Conclusions

This paper proposes an attention-based random forest model to solve the few-shot yield prediction problem. The workflow includes using the DFT feature to encode chemical reactions and using the meta-learning framework to decide the attention weights of random forest. In the fine-tuning phase, we only need several samples to acquire satisfactory performance on new reagents. Our method obtains about 20% lower RMSE when the fine-tune sample varies from 4 to 10. The effective few-shot prediction demonstrates that our method can predict the effect of a new reactant structure with few additional data. The methodology in this paper brings benefits to future work on few-shot yield prediction.