Sequential learning to accelerate discovery of alkali-activated binders

Alkali-activated binders (AAB) can provide a clean alternative to conventional cement in terms of CO2 emissions. However, as yet there are no sufficiently accurate material models to effectively predict the AAB properties, thus making optimal mix design highly costly and reducing the attractiveness of such binders. This work adopts sequential learning (SL) in high-dimensional material spaces (consisting of composition and processing data) to find AABs that exhibit desired properties. The SL approach combines machine learning models and feedback from real experiments. For this purpose, 131 data points were collected from different publications. The data sources are described in detail, and the differences between the binders are discussed. The sought-after target property is the compressive strength of the binders after 28 days. The success is benchmarked in terms of the number of experiments required to find materials with the desired strength. The influence of some constraints was systematically analyzed, e.g., the possibility to parallelize the experiments, the influence of the chosen algorithm and the size of the training data set. The results show the advantage of SL, i.e., the amount of data required can potentially be reduced by at least one order of magnitude compared to traditional machine learning models, while at the same time exploiting highly complex information. This brings applications in laboratory practice within reach.


Introduction
Concrete is the most widely used building material on earth. In 2019 alone, worldwide 30 billion tons of concrete was produced-that is almost four tons for every single person [1]. Since the cement is the most important constituent of concrete and its production is associated with the emission of CO 2 , around eight percent of man-made CO 2 emissions come from the production of cement [2]. The increasing demand for infrastructure development throughout the globe related to increasing population and consequently economic growth leads to continued high consumption of cement-based building materials. If the climate agreements are adhered to, the question arises as to what extent ordinary Portland cement (OPC) as a building material is still viable as a mass product [3]. In this regard, the cement technology roadmap defined several ways to reduce the CO 2 footprint at every step of cement production and huge afford has been made to develop more environmentally friendly binders [4]. However, exploring more CO 2 -friendly alternatives requires an extensive experimental program and, consequently, long development cycles until they can be used in practice [5]. Alkali-activated binders (AAB) and/or geopolymers (GP) omit the energy-intensive kiln process that is responsible for a large part of the CO 2 emissions of conventional types of cement and therefore are considered as environmentally friendly alternatives [6].
AAB/GP are synthesized by the reaction of an aluminosilicate precursor with an alkaline source [7]. Natural or artificial pozzolans, including granulated blast furnace slag, fly ash, metakaolin, or natural pozzolans, can be used as aluminosilicate sources [7]. However, the nature, chemical and mineralogical composition of the aluminosilicate precursor affect the properties of the resultant binder, such as fresh and hardened characteristics [8]. On the one hand, the wide variety of precursors available for the synthesis of AAB/GP makes them a suitable alternative binder for many applications. On the other hand, such large variations in the precursor composition make it difficult to draw general valid rules to synthesize these binders with the expected properties. Moreover, the type and concentration of alkaline solution impact the reactivity of the aluminosilicate precursors used for AAB/GP [8]. Several publications have shown the potential of granulated blast furnace slag and fly ash as a precursor for AAB/GP [5]. However, the environmental protection laws in many countries and European environmental goals aim to reduce and eventually omit the burning of coal and encourage renewable energies [9]. The German environmental goals target the closing of coal power plants by the year 2038 [10], which means the available amount of fly ash from local sources in Germany will decrease in some years. Similarly, the iron industry aims at using alternative renewable fuel sources for its manufacturing process, which means that the composition and quality and probably also the amount of blast furnace slag will be affected [11]. Moreover, for more than a decade, fly ash, bottom ash and boiler slags are fully utilized in European countries in either producing blended cements, direct addition to concrete or other applications [5,12]. Henceforth, the decreasing amount of common pozzolans and their full utilization by industry lead to the thrive of searching for new alternative precursors, including natural (such as volcanic ashes) and artificial resources (such as urban and agro-industrial wastes) [13]. The increasing demand for the search for alternative aluminosilicate sources and the complexity of the synthesis of AAB/GP binders makes it challenging to gain fast progress in this field.

Objective, scope and novelty of the research
In order to achieve the desired breakthrough in the shortest possible time, optimization of the materials research process is urgently needed. Sequential Learning (SL) and the closely related Bayesian optimization have repeatedly been reported to have great potential in accelerating drug and material discovery [14,15]. The basic idea is to reduce the number of unsuccessful experiments (i.e., that lead to materials with unwanted properties) so that an ideal sequence of successive experiments is achieved. This is accomplished by coupling a prediction model (e.g., a machine learning model) with a decision-making rule based on a so-called utility function that guides the experimental program. Work that argued for the fundamental predictability of the properties of cementitious materials in general and AABs in particular with data models has been published (see Table 1). However, to the best of the authors' knowledge, there is no work available that investigates the transferability of SL methods to AAB research. Figure 1 illustrates the conceptual differences between classic Machine Learning (ML) (Fig. 1, left) and the novel SL approach (Fig. 1, right). The figure shows a mathematical space spanned by two base materials X1 and by X2 (in the real scenario, additional features maybe considered). Depending on the respective mixing ratio, the synthesis of X1 and X2 results in materials whose resulting material properties are shown in color (turquoise areas for low performance and pink areas for high performance materials). ML and SL model this relationship using sampled data (black and white dots in Fig. 1, respectively). These models can be employed to predict ideal combinations of X1 and X2. Classical ML approaches accomplish this (mostly) through interpolation, whereas SL is inherently extrapolative and thus potentially requires much fewer sample data. However, serial data collection of SL, even if successful, may be disadvantageous, because waiting for experimental results could delay experimental progress. This is especially the case for materials whose synthesis is complex and whose material properties require time to develop or characterize (e.g., 28-day compressive strength of binders). Collecting all samples at once or in batches can be more successful. SL could therefore fill a gap in material innovation where data are not available or large-scale data collection would be too expensive.
The scope of this paper is to investigate exactly under which conditions SL can contribute to accelerating research regarding the properties of alkali-activated binders. For this, we have compiled an AAB data set with 131 samples from several publications. Based on these data, we investigate how the SL approach, the complexity of the task, the quantitative research objective and the desired success rate influence the performance of SL and deduce some of the critical circumstances under which SL can potentially enhance AAB research practice. We introduce a novel utility function that adapts common utility functions for applications with minimal training data, i.e., lower number of experiments to reach the optimal sample design.

Structure of the research
The presented paper is organized as follows. First, we survey the literature in ML for concrete property prediction in terms of its potential and applicability for alkali-activated binder research. We combine this with an overview of current AAB research to highlight the practical potential associated with an optimized search by SL. We then briefly describe the SL algorithms which were used in this work. We use Gaussian processing regression (GPR), decision tree regression (DT) and tree ensemble regression (TE) and four different utility functions, namely maximum expected improvement (MEI), maximum likelihood of improvement (MLI) and introduce a novel combination of MEI & MLI with a maximized distance measure, respectively. We go on to describe how we benchmark the acceleration in the experimental program. In the experimental section, we explain how data were acquired from literature and describe our investigations with SL. The result section shows which constellations lead to a successful acceleration of the experimental progress. We conclude by discussing how scientists can integrate SL into their workflow to fully exploit the potential of SL.

Literature review
Because of various properties and many possible combinations of the constituents of building materials, early on, a computer-aided approach was considered [16]. Beyond the mere streamlining of the concrete mix design, capabilities have expanded significantly by introducing ML approaches. Models that interpolate the properties of new mixtures are generated from data of existing compositions where the properties have been determined in the laboratory. Naderpour et al. [17] summarized some of the earlier models. Chaabene et al. [18] review the available literature on ML for the prediction of mechanical properties of concrete. Table 1 focuses on recent research in this field by summarizing models from the past five years (2017-2021).
ML-based approaches exist for a wide range of cementitious materials, from OPC-based binders to AAB and recycled aggregate concrete, to name a few. In most cases, the models are used to predict compressive strength-one of the most important properties of concrete. With an average of 655 samples, the ML models require large amounts of data collected by laboratory standards. Remarkably, only 6.67 features (factors effecting the properties) are used on average to represent the composition of the material. The level of detail of the data considered is thus low relative to the complexity of alternative binders, as the properties of AAB are depended on several factors and the wide characteristics of aluminosilicate precursors and alkaline activators directly impact the properties of resultant AAB [8].
Based on the analysis of 111 distributed data sources, Xie et al. [46] state that much more detailed data are needed for a general understanding that goes beyond the limits of a single available AAB. This is a common challenge in materials science, where millions of possible compounds span a high-dimensional discovery or search space, of which only a tiny fraction has been experimentally explored. The task of material discovery is to find desired properties in this space. The challenge is to make the best use of the limited knowledge from the few data points available. Currently, applied ML-based cement models are infeasible for this task because they cover only a few dimensions of the search space and require large amounts of data.
Despite success in other experimental sciences such as drug discovery, relatively few publications of SL exist in materials science. Lookman et al. [14] give an overview of the SL landscape in materials science. Despite the fact that SL regularly starts with a machine learning model based on only a few very high-dimensional data points and requires relatively little additional data from the laboratory, it often outperforms baseline benchmarks. However, these benchmarks are mostly statistically reached and do not necessarily mean that SL will also accelerate research in practice. Here, experimental designs often determine the speed and SL would possibly entail an adjustment of the entire process. Yet even in simple statistically motivated scenarios, specific performance is usually highly dependent on the data and the problem and the exact relationships are still largely unknown in [47]. Lookman et al. conclude that despite many studies that use machine learning to make predictions, feedback between ML and experiments via SL has only recently been investigated in materials science.
The following section surveys what information is required to be able to make predictions of AAB properties. The available literature about AAB/GP shows that the properties of these binders, such as compressive strength, depend on a vast variety of factors, far beyond those considered in Table 1. These factors may include the chemical and mineralogical composition of the precursor, type, concentration, SiO 2 /M 2 O (M = alkali cation) of the alkaline solution used, curing conditions, specific surface area, waterto-binder ratio, degree of silicate polymerization, age of the sample and others [8,[48][49][50]. The aluminosilicate source materials for alkali-activation show a wide range of oxide composition and mineralogical phases. The chemistry of the reactive phase mostly defines the reactivity of the raw material. However, the composition of the alkaline solution has an enormous impact on the strength developing phases [8,50]. An optimum range of silica modulus is necessary to achieve higher compressive strength for natural pozzolan-based geopolymers while the other effecting factors were kept the same [8]. This optimum range of silica modulus, in turn, changed for every type of raw material. For various precursors, different alkaline solutions in different concentrations have been recommended to achieve the desired properties [49]. Similarly, the specific surface area measured either in terms of Blaine fineness or particle size distribution (d 90 , d 50 , d 10 ) affects the degree of reaction and eventually the compressive strength [7]. The amount of water present in these binders has a complex role and acts as a reaction medium. The extend of silicate polymerization of silicate solution influences its reactivity. Consequently, the number of experiments to be conducted to reach the optimum compressive strength can be extremely high and the certainty of reaching the optimum is low. Given this complex material behavior and the amount of experimentation conventionally required, advanced tools like SL have enormous potential to accelerate the material discovery process.

Setting up sequential learning for materials discovery
The underlying idea of SL is that not all experiments are equally useful. Some experiments provide more information than others. In contrast with classical design of experiments, where (only) the experimental parameters are optimized, the potential outcomes of the experiments themselves are the decisive factor. The most promising experiments are preferred over dead-end experiments and experiments whose outcome is already known. Experimental results are used to iteratively improve the ML model with highquality data. Each new experiment is selected to maximize the amount of useful information, e.g., according to [51], using previous experiments as a guide for the next experiment. Figure 2 provides a chart that depicts the workflow of SL. The prediction of material properties in SL is based on a list of candidate materials given by experts using their domain knowledge, see (1) in Fig. 2. Materials may be of interest because they are available, cheap, known to have further desirable properties, or simply because they seem generally promising. Although the exact criteria are not specified, it is recognized that the performance of the SL for material discovery is related to the quality of the candidates [47]. The candidate materials are represented in the so-called design space (DS)-a vector space that is comparable to the feature space in classical ML approaches. In the DS, the coordinates of each material are parameterized information about raw material, (micro-) structure and processing. An initial training data set with known target properties serves as an input for the prediction model in the first round (see (2) in Fig. 2). At the core of the iterative SL task is the prediction of experimental outcomes ((3) in Fig. 2), weighting the expected utility and deciding which candidate to investigate next ((4) in Fig. 2). The utility is commonly estimated based on the predicted material property (the closer a predicted experimental result is to the desired value, the more useful it is) and a measure of uncertainty. The latter is a key driver for discovering new relationships and the basis of experimenting in general. As Reyes et al. [15] aptly summarize it, ''Actually, the outcome of an experiment is the deviation from what we expected.'' In other words, if the outcome of an experiment is already known, there is no reason to conduct it and an experiment can be more useful if the uncertainty of its outcome is large. Lookman et al. [14] even state that they are not aware of an SL study where a new material has been discovered without utilizing uncertainties. In this sense, uncertainty can be considered an essential factor in the decision-making process. The SL task is finished as soon as the desired property is obtained, (5) and (6) in Fig. 2.
The following section lays out the prediction methods, uncertainty estimates and selection strategies utilized in this work.

Prediction methods and uncertainty estimates
We compared common regression methods against Gaussian Process Regression (GPR), which has its origin in adaptive sampling and is still one of the reference methods in SL or Bayesian optimization [14]. The regression methods investigated include shallow neural network regression (NNR), linear regression (LR), decision tree regression (DT) and bagged decision tree ensembles (TE). Our preliminary analysis showed that LR and NNR perform worse compared to others and that NNR additionally required much higher computational capacity. The drop in performance is probably due to the higher proneness of these methods to co-linearity in sparse DS. This effect could be minimized by regularization but would require additional hyperparameter tuning, further increasing the computational effort. Based on these observations, we concentrate on the comparison of GPR, DT and TE in this paper. These methods are briefly described as follows.
Originally, decision trees and tree ensembles are classification algorithms that learn the segmentation of an input data space, e.g., the DS, from pairs of data and labels [52]. By introducing one class per discrete label value (and interpolated intermediate values), pseudo-regression is performed, meaning that interpolated predictions are possible, but extrapolations outside the range of values of the label set are not. The core of tree-based algorithms is the sequential decision-making alongside the values of the respective input variables. In that sense, the data points are not considered as a ''whole'', but each coordinate is independently partitioned into discrete label values. By nature, this makes it relatively ill-suited to capture inter-parameter correlations. However, this can be advantageous for high-dimensional data, where unwanted correlations (so-called co-linear behavior) often result from the limited amount of available data (as expected with AAB data). In addition, the set of successive decisions is limited. This can result in a prioritization of relevant DS-parameters, which helps to further reduce problem complexity.
Ensemble trees resample the training data-e.g., by a random draw with replacement-and train a new  decision tree on each of the draws. The resulting ensemble tree is, depending on the respective algorithm, the average of the tree ensemble (so-called bagging) but can also have a more complex algorithmic nature that includes, for example, an errorweighted average of the trees (as in boosting) [53]. Ensemble learners generally reduce the influence of noisy training data on the prediction and create more refined decision rules. However, resampling requires slightly more data which could have a negative effect for very small data sets (as is to be expected in an early experimental stage). In this work of the shelf, MATLAB functions ''fitrtree'' [54] and ''fitrensemble'' [55] with ten surrogate splits and 10 ensemble trees were used with standard settings.
A crucial parameter of many SL methods is the uncertainty of a prediction (see section SL). More precisely, the epistemic uncertainty from the potentially erroneous assumptions of a model due to incomplete information is sought. Most ML methods do not provide an estimate of this by default because they are point estimates. However, it can be calculated as the dispersion of the prediction under slightly varying boundary conditions. To this end, varying training datasets can be created by resampling (such as jackknife bootstrapping) from the original training set as in [56]. The uncertainty then corresponds to the prediction scattering of the models trained on different samples of the training data.
Gaussian Process Regression has been introduced by Krige in the year 1951 [57] and is a probabilistic model. The core concept of GPR is to assume an underlying distribution, i.e., to treat the data as random variables. Unlike DT and TE, which learn exact values for each parameter in a function, GPR derives a probability distribution using Bayes' rule. It updates prior knowledge (in the case of GPR, a specified distribution function) with observations (the training data) to compute a joint posterior probability distribution over all possible values. It contains information from both the prior distribution and the data set. Predictions are made through the joint distribution by weighting all possible predictions with their calculated posterior distribution. The output is the point estimate at the considered point of interest, which yields an expected value and a variance. As the latter can be naturally used as a measure of uncertainty, GPR has been a popular method for sequential learning or the closely related Bayesian optimization applications [14]. The probability functions in GPR are commonly specified by a multivariate Gaussian distribution (in theory, other distribution can be used, too)-a so-called Gaussian process (GP)-which is defined by a mean and covariance function. The selection of GPs can incorporate a priori knowledge about boundary conditions, e.g., when periodicity, dependencies, various length scales or general trends are known. However, this is rather relevant for time series and locationdependent data and has no proximate applicability for the presented case. Furthermore, GPs control the smoothness of the (interpolated) predictions. We compared all GPs that are implemented in MATLABs statistics toolbox [58] and found that the exponential GP performed best in the SL task.

Strategies and utility functions
SL executes a strategy to select the next input by prioritizing the predictions, which are weighted by a utility function. The prioritization is conducted bydepending on whether the objective is to minimize or maximize a criterion-choosing the minimum or maximum weighted value. For simplicity, only the maximization case will be considered in the remainder of the manuscript, which can be described by Eq. (1), where x nþ1 is the selected next candidate and argmaxðuÞ corresponds to finding the maximum utility u. Three general strategies can be distinguished. 1. Explorative strategies attempt to reduce model uncertainty by using utility functions that favor candidates with large prediction uncertainties. 2. Exploitative strategies tend to reinforce the current model perception by considering only the predicted values by the utility function (without considering uncertainties). 3. The third group is balancing between exploring and exploiting. Only 2. and 3. are greedy strategies and thus suitable for most material finding problems.

Maximum expected improvement (MEI)
The (MEI) strategy [56] purely exploits by simply selecting the next candidate according to the maximum prediction value. The utility u i of the i À th prediction is simply: where l i is the mean prediction of the i À th candidate.

Maximum likelihood of improvement (MLI)
The MLI strategy [56] is an explore and exploit strategy. It selects the candidate with the highest likelihood to exhibit the desired target property. In the case of normally distributed prediction, the candidate with the highest 95 percent likelihood can be determined according to Eq. (3), where Q 95% ð Þ is the 95% quantile, l i is the mean prediction of the ith candidate and r i is the standard deviation of the i-th candidate.

MEI and MLI with maximum Euclidean distance (MEI ? D and MLI ? D, respectively)
At the beginning of an SL run, the predictive power of ML algorithms is relatively poor due to the small amount of training data. The data are further reduced by sampling for uncertainty estimation by DT and TE, with only a portion of the data available for each sample. This causes a situation where many candidates yield the same prediction and uncertainty value, despite the fact that their composition and processing's are very different. Candidates that have a large average Euclidean distance to the known DS candidates differ naturally more in their design. Their choice would increase the data variability and, in turn, the predictive model's performance will be most improved. This a-priori knowledge is naturally part of GPR, such that it outputs higher uncertainties for more distant data points. The utility function can be adjusted in a similar way by choosing the value that has the largest mean distance to the known DS candidates from a given range of prediction values. The MEI ? D or MLI ? D utilities were estimated according to Eq. (4) and (5), respectively.
x j À x Qðu MEI ;0:9Þ;i ð4Þ where meandist is the mean Euclidian distance, x j are j-th coordinates of the known training data with n samples and x Q u MEI ;0:9 ð Þ ;i and x Qðu MLI ;0:9Þ;i are the DS coordinates of the i À th candidate with a greater than 90% quantile of the MEI or MLI utility. The MEI ? D and MLI ? D strategies aim at boosting the initial rounds of a SL run and hence were restricted to the 15 first iterations in the presented work. The utility was then calculated according to the MEI and MLI.

Benchmarking SL against a Random Process (RP)
Although SL is based on ML methods, classical errorbased ML benchmarks typically do not apply in this context. This is because the target of SL in materials discovery is to find a candidate with-depending on the property-maximum or minimum value of a said property. In a reasonably set scenario, this goal is always achieved with zero error and is merely a matter of iterations. Although this comparison is somewhat odd from a mathematical point of view, it underscores the fact that the focus here is on the effort required to reach this threshold as a measure of performance. A common metric is the required number of experiments between (5) and (3) in Fig. 2 until a set target is reached ((6) in Fig. 2).
To determine the performance of SL methods, it is common to use simulated experiments where the ground truth labels for all data points are already known [14]. Initially, only a small fraction is provided to the SL algorithm (although more training data would be available). This is extended with one new data point from the remainder of the available data at each iteration. It is investigated which approach requires the least amount of data to achieve the goal ((6) in Fig. 2). Approaches that require less data simply lead to faster success in laboratory practice.
Thus, the goal is not to actually discover new materials using all available data, but to validate material discovery methods for scenarios where fewer labels are known (e.g., for new materials). In this approach, the generalizability is statistically demonstrated by quantifying the performance of SL methods under randomized initial conditions and then expressing it, for example, as a mean value and standard deviation. This allows meaningful comparisons between different SL approaches. To generate randomized initial conditions in an in-progress experimental study would require significant additional effort and is unrealistic in most cases. Therefore, comparisons of performance and repeatability between different SL methods in actual material discovery are usually not possible.
This approach also differs from the classical ML approach, where generalizability is demonstrated on retained test data. However, this luxury is often not afforded in experimental science, where data are extremely limited due to costly acquisition [14].
SL is commonly compared against a Random Process (RP) (i.e., acting without a strategy and model) as a baseline benchmark. RPs consider each candidate equally likely to succeed (uniform distribution). The average number of draws necessary to find the maximum target property is 50% of the given candidates, respectively. This is the benchmark against which SL competes.
Despite the fact that this benchmark is often surpassed by SL, a significant use of SL cannot be found in practice. One reason for this may be the significantly higher effort that is caused by the sequentialization of the experimental procedure in SL. This means that from a purely functional point of view, RP can produce the desired results faster if the parallelization of experiments is more effective. In view of this situation, it is worthwhile to include further parameters for the consideration of the usefulness of SL in practice.
The specific value of the target threshold T (i.e., the property value to be exceeded) inherently affects the iteration required; the smaller the T, the fewer iterations are required for SL to succeed. From a practical perspective, relatively small deviations of the highest cement strengths contradict a special significance of a unique strength value as the target (especially considering the aleatory uncertainties of this value). To accelerate experimental progress, one can argue to reduce T, to a value that lies in the upper quantile of strengths (e.g., T ! f ðc;90%Þ ) without losing much significance of the results. Furthermore, the aspired success rate determines the number of experiments required. The relationship is simply: the higher a desired success rate, the more experiments are needed. The performance of SL at a certain success rate can be empirically determined as the quantile of the required draws from multiple SL runs. In the laboratory practice, the required success rate is expected to be much higher than the 50% rate, which is, as mentioned above, the typical benchmark for SL.
The relationship between success rate and target threshold can be described analytically for RP as the hypergeometric cumulative distribution according to the following equation: where p corresponds to the success rate, N is the size of the population, K is the number of items with the desired characteristic in the population and n is the number of samples drawn. The threshold of success T can be defined in terms of the parameters M and x. According to Eq. (6), the success rate p has a nonlinear relationship with the required draws for the case of multiple targets (where M [ 1), i.e., the before-mentioned rule that a 50% success rate requires 50% of the possible experiments holds not for those cases. Instead, much less data are required. The exact amount further depends on the size of the population x where a greater x leads to fewer required draws n. This means firstly that RP becomes a much tougher benchmark when T can be reduced to include more successful candidates. Secondly, a fair comparison against RP must consider the maximum available size of the population x. For example, if the DS is fragmented into several smaller DSs to parallelize SL, the RP performance must still be considered on the whole DS, as its parallelization is independent of the segmentation. In other words, benchmarks can only be compared among same size DS and the perception of the performance compared is skewed favorably to SL when smaller DSs are used.

Experiments
In this chapter, the data collection and numeric experiments are described.

Description of the data
For this study, the data related to the material discovery for acquiring samples with higher compressive strength three data sets from five different studies were collected. This included data from [8,[59][60][61][62] about alkali-activated binders prepared from four different natural pozzolans originating from Germany and Italy, pumice stone-based natural pozzolans and granulated blast furnace slags.

First data set
The first data set included the data about four different natural pozzolans originating from Germany and Italy used to prepare geopolymers/alkali-activated binders with sodium silicate solution taken from [8,59,62]. These pozzolans included Rhenish trass (RT) obtained from Eifel region, Germany, Bavarian trass (BT) obtained from Nö rdlinger Ries, Germany, pozzolan Laziale black (PB), Ponte Lucano quarry, Tivoli, Italy and pozzolan Flegrea (PF), Campi Flegrei area, Naples, Italy. In a study by Firdous and Stephan [8,59], these four pozzolans were subjected to react with sodium silicate solution of various silica modulus in the range of 0.4 -1.7 (SiO 2 / Na 2 O molar ratio) at ambient conditions (22°C, 100% RH) to see the impact of silica modulus of alkaline solution on properties of the resultant binder. The conditions at the time of formation of pozzolan, such as the composition of magma and conditions after formation, affect the chemical and mineralogical composition of the pozzolans. Therefore, all the pozzolans had different chemical and mineralogical compositions as determined with XRF and XRD, respectively. The fineness of the material can be measured in several ways and as Blaine fineness is the quickest and most commonly used method therefore for a better comparison, the authors measured the Blaine fineness following EN 196-6 [63] and kept it in a close range (6700 ± 160 cm 2 /g) for used pozzolans, whereas the d 50 particle size measured using laser granulometry, Mastersizer 2000 of Malvern Instruments changed. In this study, sodium silicate solution in various silica moduli was used. To achieve various silica moduli, various combinations of sodium silicate solution (SiO 2 = 30.2 wt.%, Na 2-O = 14.7 wt.%) and sodium hydroxide solution (3.6, 6.6, 9.2 and 11.5 mol/L) were used. Therefore, the resultant SiO 2 mol.%, Na 2 O mol.%, H 2 O mol.% and SiO 2 /Na 2 O (mol/mol) are considered here for the analysis. As the change of either of them can affect the compressive strength. The alkaline solution to pozzolan ratio impacts the compressive strength; therefore, for each pozzolan, the amount of solution required to obtained sample of good workability was determined and kept equal to 0.50, 0.75, 0.43 and 0.52 for RT, BT, PB and PF, respectively. All the paste samples prepared for compressive strength were mixed in a similar manner and cured at 22 ± 2°C, 100% relative humidity (RH). Cubes with an edge length of 2 cm were used to determine compressive strength at the same loading rate for all the samples.
A further extension of the data presented above was taken from [62], where the pozzolan Laziale black and pozzolan Flegrea were additionally subjected to react with sodium and potassium hydroxide solution (9.2 and 11.5 mol/L). The prepared AAB were heat-cured at 40°C and 100% RH for the first 28 days, followed by ambient temperature (22°C, 100% RH) curing till 90 d. The heat-cured samples were further compared with the samples only cured at ambient conditions. To see the impact of different mineralogical compositions and fineness, the authors used a low reactive pozzolan (Bavarian trass) for further analysis [59]. In this study, the Bavarian trass was subjected to calcination and mechanical activation by burning the pozzolan at 700°C for 3 h and by milling the pozzolan in a planetary ball mill for 5 and 10 min at a speed of 200 rpm with constant media to material ratio of 1:0.16. The calcination resulted in a reduction in the calcium carbonate content of the pozzolan and the formation of CaO, whereas the mechanical activation resulted in 23% and 43% reduction in d 50 in comparison with un-milled Bavarian trass for 5 and 10 mint milled material, respectively. For the preparation of AAB/GP samples, sodium silicate solution of silica modulus 0.707, 0.797 and 1.061 was used, whereas mixing method, rate of loading, sample size and curing conditions were kept similar as in the above study.

Second data set
The second data set is taken from [60], where the authors used pumice type natural pozzolan obtained from Taftan mountains, Iran, for the production of alkali-activated binders. Mineralogically and chemically, this pozzolan was different from those given in the first data set. The chemical composition measured following ASTM C311 [64] showed high siliceous content. The Blaine fineness of the used pozzolan was 3090 cm 2 /g which is lower than the fineness of pozzolans used in the first data set. Sodium silicate solutions of different silica modulus in the range of 0.3 -0.9 were used, and Na 2 O/Al 2 O 3 molar ratio of the system was also changed in the range of 0.77 -1.23. To achieve the desired silica modulus, a combination of sodium hydroxide and sodium silicate solution was used. For determination of compressive strength, 2 cm 3 samples were cured at 25°C and 95% RH.

Third data set
The third data set was published by Tänzer [61]. In this work, artificial glasses were prepared by blending and remelting mixes of one GGBFS and several oxides. The single melting experiments were performed between 1550°C and 1650°C in a nitrogen atmosphere, and the material was granulated at 3 bar in water of 10°C. The granulate was ground on a laboratory ball mill. The laser granulometry with a Horiba LA-300 revealed values for d 50 from 8.5 to 16 lm, and the Blaine fineness according to DIN 66,126 [65] and EN 196-6 [66] was found to be 4120-4800 cm 2 /g. The chemical constituents of the synthesized slags were determined following EN ISO 11,885 [67] based on ICP-OES in the case of Al 2 O 3 , CaO, Fe, K 2 O, MgO, MnO, Na 2 O, S and TiO 2 . Furthermore, SiO 2 was determined gravimetric according to EN 196-2 [68]. Due to the non-oxidizing conditions while glass manufacturing, the oxidation state of Fe and S is unknown. For the current evaluation, it was assumed that both elements are fully oxidized.
Sodium hydroxide (2 mol/kg), sodium waterglass (Ms = 1) and two potassium waterglass (Ms = 1 and 2) solutions were used as activators. The solutionpowder ratio was between 0.374 and 0.432. Pastes were prepared by a hand mixer in 30 s. The pastefilled molds for 2-cm cubic samples were treated on a vibration table and thereafter covered with foil until the demolding after 23 ± 1 h. The demolded cubes were stored at 20°C and 100% RH.

Summary of design space-features and target property
Based on these three data sets with a total of 131 samples, the DS was constructed. The 22 features used in this study included for the powdered precursor the 13 parameters of the chemical composition in wt.% (SiO 2 , Al 2 O 3 , Fe 2 O 3 , MnO, MgO, CaO, Na 2 O, K 2 O, TiO 2 , P 2 O 5 , SO 3 , Cl and loss on ignition). For characterizing the granulometry, three features were used, namely the Blaine fineness [cm 2 /g] as well as d 90 and d 50 particle size [lm]. The composition of the alkaline solution is given with four features in terms of SiO 2 mol.%, Na 2 O or K 2 O mol.% and H 2 O mol.%. Furthermore, the curing temperature and the solution-powder ratio were used. All features were normalized so that their respective mean was zero and the standard deviation was scaled to one.
The resultant compressive strength [MPa] for 2Á2Á2 cm 3 samples at the age of 28 days is the sought-after target property.

Investigated scenarios
One way to visualize the SL task is to represent the DS in T-SNE coordinates. T-SNE [69] is a dimension reduction algorithm much like the well-known principal component analysis (PCA). These allow to represent higher dimensional vector spaces (like our DS) in a lower dimensional form (e.g., in two dimensions). The dimension reduced space has no physical meaning anymore, but allows to analyze the characteristic distribution of the data points, i.e., points that are close to each other have similar feature values. Trends in the distribution of the data, such as the relationship between feature values and materials properties can be inferred (with some uncertainty). In contrast with PCA, T-SNE employs nonlinear dimensionality reduction. This has the advantage of preserving the relationships between neighboring data points, while reducing the large interdimensional distances that occur especially in sparsely populated high-dimensional spaces. Figure 3 shows the distribution of our data set in T-SNE coordinates: on the left, subsets 1-3 from Sect. 3.1.1 to 3.1.3 are represented by different colors; on the right, the distribution of compressive strength is shown as a color scale.
The clear distinction of the data on the left side indicates that the difference in material composition is well represented by the features. The global optimum of the strength distribution is found in the third data set (cf. pink dot in Fig. 3, right). The strength distribution within the three materials appears uniform with relatively clear trends toward the respective local maxima, suggesting a good relationship between features and material properties. We compared SLs ability to find the higher-strength materials in various scenarios (see Table 2).
First, combinations of the three prediction algorithms with four utility functions were investigated.
Second, the complexity of the SL task was altered, with SL performed in a common DS containing the 131 data points and a segmented DS. The segmentation breaks the complex optimization problem into separate, expectedly simpler problems-potentially creating better predictability and possibly further speed-up. The segmentation was achieved with kmeans clustering (with k ¼ 3) of the data in two-dimensional T-SNE coordinates using the standard functions in MATLAB, respectively [70,71]. This created clusters that correspond to the three data resources (comparison Fig. 3, left). In addition, the segmentation of the problems allows parallelization, since the SL runs could be executed in batches consisting of the segments.
Third, the initial training set size was varied between four and twelve. The minimal initial training set size of four was restricted by two sub-samplings in the TE with uncertainties algorithm (each leaving out one candidate) and a minimum of two points in a sensible regression problem. The upper boundary of the initial training set was restricted to 12 to allow 30 different samples to be drawn from the lower half (as per restriction the 50% quantile restriction discussed below) of the smallest data set segment.
Fourth, the task of the SL has been varied in terms of the success threshold between finding the absolute maximum within one DS and finding one of the above 90% quantile compressive strengths within one DS.
Fifth, the benchmarking was carried out statistically in terms of a 50% (Figures 4 and 6) and 97% ( Figures 5 and 7) success rate. The former is the usual benchmark indicating the average performance and the latter indicates the robustness, which has special importance in scientific studies. The success rates were estimated from 30 SL runs with randomly sampled initial training sets. These initial samples were restricted to the lower half (50% quantile) of the strengths to enforce a significant improvement in the discovery process. Since extrapolation is much more difficult than interpolation for many learning algorithms, it is hoped that this will provide a more realistic understanding of SL performance. It is  acknowledged that this restriction has a considerable influence on the performance of SL. However, in practice, whether the increased performance would be achieved remains largely unknown and depends on the given candidate group (see further details in [47]).

Results and discussion
The investigated scenarios include the twelve SL algorithms, which are used in two different DS, initialized with three different initial training set sizes to achieve two different target thresholds, respectively, which are statistically analyzed in terms of a 50% and 97% success rate. This results in a total of 288 SL results which are shown in Figs. 4, 5, 6 and 7. The second to last line in each figure shows the average performance of the SL algorithms, and the last line  comparable range of numbers. This has practical relevance because-as mentioned in introductionthe former can be parallelized and thus be quicker in practice. Generally, it seems to pay to start SL as early as possible with few initial training data. The higher cost of collecting a larger data set in the initial phase is sometimes complemented by even longer run times in terms of more experiments needed in the SL run. This demonstrates how important it can be for the success of SL to explore the DS quickly and that a larger set of weak initial assumptions can cause a confirmation bias that negatively affects the course of SL.
The higher performance of SL over RP is more likely at higher success rates (see Figs. 5 and 7), which speaks for the higher robustness of SL. This is practically relevant because it shows that, depending on the success rate requirements, approaches with higher robustness perform better than approaches with the more commonly compared average performance. When comparing RP, it seems that in Figs. 4 and 5, RP is a benign benchmark with relatively weak performance and low robustness and that the lower target threshold in Figs. 6 and 7 has a strong favorable impact on performance and robustness-making RP an ever more challenging benchmark.
In contrast, lowering the target threshold improves the average SL performance only slightly. Since SL requires training data, it must have a performance offset (in terms of the minimum required experiments) to its disadvantage at the beginning-the more initial training data, the greater. RP, however, can just be lucky with the first draw. SL's only chance of compensating for this disadvantage is to succeed quickly and repeatedly. This in itself is a remarkably difficult task since the higher-strength candidates are in different regions of the high-dimensional DS than the initial training data (cf. Figure 3, right); this means that SL has to extrapolate considerably. The MEI ? D and MLI ? D utility functions accelerate SL's DS exploration to the extent that makes it possible to outperform RP even for a lower target threshold. This is observed in Fig. 7 for the joint DS in the case of four initial training sets for DT and MLI ? D as well as TE and MEI ? D and MLI ? D. Adding the distance to the utility function essentially extends the exploratory component of the algorithms. However, this does not seem to work for GPR. This can be explained by its tendency to revert predictions to the function mean for more distant points. This means that candidates that have a greater distance (which is exploited by MEI ? D and MLI ? D) will get lower prediction values. However, since only the 10% highest predictions are considered, candidates with smaller distances are more likely to be chosen here, resulting in less exploration. The exception is the segmented DS, where the distances between the points are smaller. Here, MEI ? D and MLI ? D improve the robustness of GPR in the low target threshold scenario (cf. Figure 7). The performance of GPR is clearly best when combined with the MLI strategy. DT and TE, on the other hand, benefit from MLI over MEI, especially in the larger joint DS. Looking at the results, the uncertainty-based and exploratory approaches tend to be more successful in larger DS-as would be expected. Adding more training data also improves DT and TE performance in some cases. However, in the lower success threshold scenario, this is not enough to match RP. In addition, MEI ? D and MLI ? D perform poorly when more initial training data are added. The overall best approach is consistently TE with MEI ? D or MLI ? D with four as the initial training set size in the joint DS. Figure 8 illustrates how the relationship between model complexity and model efficiency (note the non-linear scaling of the Y-axis) affects feasibility for laboratory practice. Here the characterization of AABs, on the one hand, involves considering many parameters that potentially vary widely between candidate materials (as discussed in the literature review Sect. 2.1). On the other hand, collecting data is time-consuming and usually must be repeated for each new material. Both aspects limit the applicability of current ML models in practice. Although the specific course of the feasibility boundary depends on the respective laboratory capacity and the complexity of the underlying material, it is clear that the presented SL approach (blue cross) is much more effective than the state-of-the-art ML models (black dots).

Conclusions
Considering the pressing climate crisis, the large CO 2 footprint of OPCs and the complexity of new binders, new materials discovery methods are urgently needed. The amount of available data is one of the major bottlenecks in using ML for this. Existing approaches require large amounts of data and generalize potentially poorly due to their coarse input parameters. We argue for including much more detailed information in ML models and showed how useful material models can still be generated with very few data sets (just four to get started). To this end, we apply an SL approach to an AAB dataset. We analyze some of the issues that have arisen for use in laboratory practice.
The results show that the discovery of new precursors for construction binders can be accelerated using SL. Specifically, the acceleration is achieved in terms of fewer experiments required to predict materials with the desired properties. Figure 8 shows the tremendous improvement compared to state-of-theart ML models from the literature. To achieve this result, SL's purely statistically motivated benchmarking approaches were adapted to aspects of laboratory practice, e.g., by lowering the target threshold and comparing results at higher success rates. From this, the following conclusions can be drawn for SL in laboratory practice. Despite very complex data, it seems promising to integrate SL as early as possible in the experimental program. It has been shown that ML-based SL has the potential to outperform classical GPR. The parallelization of SL runs in three batches had a relatively small average influence on the performance in the example presented. Regarding highthroughput experiments, such parallelization procedures are vital research topics. Lowering the target threshold led to a surprisingly small boost of most SL procedures and a significant acceleration of the RP. Even if these considerations are of a rather intellectual nature, because in practice, the actual distribution of the properties is unknown-it still shows the advantages that RP can have. Incorporating randomized effects-i.e., effects that have nothing to do with the deterministic understanding of the respective ML prediction-could be a key to further acceleration for SL. This has been approached with some success in this publication by adding a distance in the DS to the utility function. TE with MEI ? D and MLI ? D achieved an outstanding performance with only eleven required experiments to find higherstrength cements with a 97% success rate (cf. Figure 7, right). In fact, this is nearly 60 times less training data than the models in Table 1 require on average, while more than three times as many material composition features were used. This approach seems to break with the paradigm in data science that more training data produces better models. This is quite significant because it lowers the threshold of adoption for laboratory practice. The work presented in this paper leads us to the conclusion that further research in this field is needed as there are still many open questions-e.g., on the influence of the DS complexity (in terms of very large DSs) or the exact interaction between the amount of initial training data and the success of the approach. It would also be interesting to undertake a systematic study of the added hyper-parameters of the utility functions with distance.
In addition, the relevance of input features for SL is an important question for future research. On the one hand, adding more features can help to better characterize the material, which can lead to even better predictability and faster development cycles. The same is true for omitting superfluous data. On the other hand, some information is currently unusable because it is either not uniformly available or features are simply not easily extracted (e.g., X-ray diffraction data). In turn, the use of such information increases the complexity of data analysis and acquisition, which can lead to longer development cycles. Furthermore, a deeper understanding of the crucial parameters would help to check SL for plausibility. After all, confidence in the method is one of the main prerequisites for its application in practice. To answer the question of how SL methods are to be configured in a wide range of tasks in a variety of scenarios, meta-learning (e.g., as proposed in [72]) could be considered. To make the required highly complex data available, a powerful (possibly semantic) data management is needed that goes far beyond the already extensive data collection by Xie et al. [46].
While the objective of this study is to find alkaliactivated binders with high compressive strength, the method can potentially be applied to other properties of the building materials, such as durability or workability. The potential of the investigated methods goes beyond material discovery and could soon enable cost-effective and ecological re-engineering of existing binders or even ''materials by design'' applications in construction.