Introduction

Tannic acid (TA) is an abundant and versatile bio-based material, which readily affords synthetic pathways for the isolation of its elementary building blocks. TA contains many hydroxyl groups, allowing it to form complexes with different macromolecules via hydrogen-bonding, hydrophobic and cation-π interactions.1,2 Abundant hydroxyl groups make TA highly soluble and stable in aqueous solutions. In alkaline conditions, TA undergoes oxidation3,4 and produces oxidized tannic acid (OTA) followed by oligomerization. Concomitant oligomerization of OTA leads to the formation of compounds with higher molecular weight and thereby decreases the solubility of the substance.4 In this form, OTA can interact with different molecules and serve as coatings,3,5 surface modifiers1,6 and emulsion stabilizers,1,3,6,7 or act as stabilizing and reducing agents to aid in inorganic nanoparticle growth,8,9,10 all the while imparting beneficial biological functionality.11,12 For instance, tannic acid has recently been shown to suppress SARS-CoV-2 as a dual inhibitor of the viral main protease.13 All these favorable aspects of OTA and other phenolic particles have fueled research into a wide spectrum of applications.14

OTA can also be crystallized into particles with structural properties that are highly sensitive to the experimental synthesis. Previously, Bhangu et al.10 developed a sonochemical method to chemically transform amorphous tannic acid into nano-/micro-sized crystalline particles without the use of reagents or organic solvents. They obtained OTA particles of different size and shape by simply varying ultrasonic parameters. Kämäräinen et al.4 further presented a facile and scalable protocol to prepare OTA of varying morphologies by altering the TA oxidation conditions. The dimensions, shapes, and the yield of these crystalline particles were highly sensitive to initial TA concentration, reaction time, initial pH, and pKa of the base.

While OTA particulate constructs can facilitate a range of new applications, particle morphology is a key consideration. In many high surface area systems that incorporate particulate matter, particle morphology and size are major contributors to their overall performance through, for example, relationships between morphology and packing,15 percolation,16 rheology,17 and bioactivity.18 Consequently, morphology plays an important role in many applications ranging from heterogeneous catalysts19 and electrochemical cells20,21 to drug delivery systems.22

In this work, we employ machine learning (ML) to explore the morphology landscape of OTA particles in the chemical design space of processing conditions. As illustrated in Figure 1, we start with OTA synthesis experiments and digitalize them into data points for particle morphology. We apply Gaussian process regression (GPR),23 an ML tool for supervised learning, to compute a surrogate model for OTA morphology. Based on the morphology model in the design space of material fabrication, we consider which particle shapes are available and learn how to tune the processing conditions to achieve an optimal outcome for a targeted application.

Figure 1
figure 1

Workflow for machine learning (ML)-guided morphology control of synthesized oxidized tannic acid particles.

GPR has been employed in materials science for experimental materials design,24,25,26,27,28,29,30 often in combination with Bayesian optimization.31,32,33 Given data within the phase space of N design parameters, GPR produces the statistically most likely N-dimensional landscape, which serves as a surrogate model of a target property.23 Gaussian processes (GPs) are capable of good data interpolation, allowing us to build good quality surrogate models with relatively few data points. They produce smooth and continuous landscapes that reflect the continuous chemical process underpinning the data, and can account for experimental uncertainties as data noise. All of these characteristics makes GPR well suited to experimental applications.

The previous study of OTA particle fabrication employed principal component analysis34,35 (PCA, an ML tool for unsupervised learning) on experimental data to ascertain that pH and pKa used in the OTA solution correlate most strongly with particle shape. We proceed to consider OTA morphology in the two-dimensional (2D) search space of pH and pKa. Sample characterization was performed by scanning electron microscopy (SEM) imaging. To digitalize the particle shape information, we quantified the physical dimensions allowed by OTA simple crystalline habits and took note of experimental uncertainties.

While PCA is a versatile tool, it was unable to offer further insight into morphology types, nor indicate optimal processing conditions. Conversely, GPR allowed us to visualize OTA particle morphology as a function of pH and pKa and delivered a chemically interpretable model. Based on the morphology landscape, our objective was to drive the morphology of particles from one-dimensional (1D) to three-dimensional (3D) shapes. Moreover, by extracting particle yield and volume from each experiment we were able to generate surrogate models for multiple experimental properties at no further cost, allowing us to pursue multi-target tuning of OTA particulate structures. In this article, we present the entire workflow necessary to carry out supervised ML applications on experimental data, with the aim to motivate similar work in the community. Data-efficient ML tools from computer science have the potential to renew experimental practices in materials engineering and boost the search for advanced sustainable compounds.

Materials and methods

OTA particle synthesis

OTA particles were synthesized using the protocol reported previously.4 Briefly, aqueous tannic acid solution (2% w/v) was prepared by adding tannic acid powder (1701.20 g/mol, Sigma-Aldrich) into Milli-Q water and rigorously stirring (magnetic bar) until completely dissolved. The pH of the solution was adjusted to a desired pH value with either 1 M KOH, 45% (CH3)3 N, 1 M NaOH, 0.5 M Na3PO4 or 25% NH4OH (see Figure 2a). Alkaline conditions required for the OTA synthesis reaction to proceed made us select pH > 7, but the base choice was varied widely and resulted in a pKa in the range [9.25, 14.9]. All chemicals were reagent-grade and purchased from Sigma-Aldrich. Solutions were covered with perforated Parafilm and were shaken continuously with an orbital shaker for 14 h. All reactions were carried out at room temperature. The grown and precipitated OTA particles were collected and stored at room temperature for further characterization. Despite the simplicity in particle fabrication, multiple experiments were needed to accurately define the conditions that resulted in the given morphology. This required arduous experimentation, as well as time, since each setup produced a specific morphology depending on the reaction conditions.

Figure 2
figure 2

Schematic illustration of the experimental protocol used for data acquisition: (a) oxidized tannic acid colloidal particle synthesis; (b) scanning electron microscope image analysis and particle dimension according to characteristic lengths d1, d2, and d3.

SEM image analysis

The synthesized OTA particles were imaged using a field-emission SEM (Sigma VP, Zeiss, Germany) with Schottky emitter at 1.5 kV without stage bias. For this purpose, aqueous suspensions of the OTA particles were cast onto pre-cleaned silicon wafers, dried in ambient laboratory conditions and sputter-coated with 4 nm Pd/Au. All imaging was performed on the same day with the OTA suspensions freshly prepared. Collected SEM images were then analyzed using ImageJ software36 to measure the dimensions of the particles typically numbering in the tens (Figure 2b). We measured the length, width and height of OTA particles as d1, d2, and d3, such that d1 > d2 > d3. The average values are reported here as the best estimate of particle dimensions. Standard deviations were recorded to estimate the experimental uncertainty on particle dimensions. All data points, error analysis, and the SEM images are presented in the Supplementary Material document.

Gaussian process regression algorithm

GPR is a kernel-based algorithm for supervised regression that relies on GP models to represent black box functions.23 Given data and the GP prior, Bayes’ rule is applied to compute the GP posterior. The GP posterior mean serves as the surrogate model, the statistically most likely form of the unknown function. The GP posterior variance supplies a local measure of confidence in the model, typically rising in regions of search space where data are scarce and decreasing in well-explored regions.

For GPR fitting we used an uninformative zero mean GP prior and the radial basis function (RBF) kernel to obtain smooth and continuous landscapes. Data noise was Gaussian-distributed with zero mean. To make the model more robust, we applied inverse gamma priors on the hyperparameters of the kernel, the length scale and variance. During regression, the two hyperparameters were fitted in an automated way by maximizing marginal likelihood: this standard GPR procedure ensures that the results do not depend on manual hyperparameter choices.23

To compute the surrogate model, we carried out GPR implemented in the Bayesian Optimization Structure Search (BOSS) code. BOSS is an open-source Python code37,38 for performing GPR and Bayesian optimization (BO) tasks to solve problems in materials science.39,40,41,42 It can read pre-recorded data sets or acquire data on-the-fly with acquisition functions. BOSS post-processing capabilities allowed us to construct surrogate model landscapes and analyze their features.

Results and discussion

We employed 10 experimental data points on crystallized OTA particles collected by Kämäräinen et al.4 to initialize the GPR model. In a departure from earlier work, the prospect of supervised learning required us to carry out experimental data analytics and consider different experimental outcomes, as well as measurement uncertainties. Supervised learning calls for a clear outcome, or label, so samples with ill-defined morphologies were not included into the ML model. Another key part of data digitalization was the conversion of experimental observations into customized descriptors for OTA particle morphology.

We started by analyzing the OTA particle morphology landscapes obtained in the 2D search space of pH and pKa for shape predictions. To test the predictive power of the model, we performed seven more experiments in key regions of the design space. The additional data also served to refine the morphology model. We validated the morphology landscapes against all experimental data collected, including the samples which were not employed in building the model. Finally, we demonstrated how additional property models for particle yield and volume were built from the same set of experiments and consider multi-objective materials design.

Experimental data set

The experimental data set was adapted for GPR supervised learning by presenting each point in [x, y] pair format. Here x is the location in the design space of OTA particle processing conditions, and y is the label, the morphology design objective for which we construct the surrogate model. Depending on the number of design parameters, x can be N-dimensional. In this work, x = (x1, x2) with x1 assigned the pH of the solution and x2 the value of base strength pKa. We limited the design space of the processing conditions (pH, pKa) to the range of ([7.0, 12.2], [9.0, 15.5]) to reflect the range of the processing conditions within which the experiments were realized.

The morphology of particles was quantified from their measured dimensions (d1, d2, d3). To facilitate comparison between data points, the particle dimension data was scaled by the magnitude of the leading dimension (normalizing the longest dimension to 1.0 for each data point). We defined the morphology label y as:

$$ \begin{gathered} y = \frac{{d_{2} }}{{d_{1} }} + \frac{{d_{3} }}{{d_{1} }}; \hfill \\ d_{1} = 1.0 \to y = d_{2} + d_{3} \hfill .\\ \end{gathered} $$
(1)

This label allows us to distinguish between 1D and 3D morphology conditions as follows:

$$ y = \left\{ {\begin{array}{*{20}c} { 0, d_{1} \gg d_{2} , d_{3} ; 1{\rm D}} \\ { 1, d_{3} \ll d_{1} , d_{2} ; 2{\rm D}} \\ { 2, d_{1} \cong d_{2} \cong d_{3} ; 3{\rm D}}. \\ \end{array} } \right. $$
(2)

While the 1D–3D signal difference across the realistic particles may be considerably lower than the ideal [0, 2] range, the choice of a physically meaningful property as label y allowed us to formulate interpretable surrogate models and gain immediate insight from GPR applications.

Next, we review the range of experimental outcomes and discuss their suitability as input for ML application. Unlike in computational research where a numerical result is guaranteed, any experimental data point may result in one of the following outcomes of experimental fabrication, illustrated in Figure 3a: (a) no particle precipitate; (b) non-quantifiable, ill-defined particle morphology; (c) good quality precipitates with quantifiable dimensions; and (d) multi-morphology precipitates. Too many experimental observations in the first two categories would suggest that the chosen design variables are not the key drivers of the material synthesis, and that the experimental design space needs further consideration.

Figure 3
figure 3

Scanning electron microscope images of precipitated oxidized tannic acid particles: (a) no precipitate; (b) ill-defined morphologies; (c) regular morphology, suitable for characterization; (d) dual morphology.

In our work, 74% of experiments (17 points) resulted in quantifiable sample morphology. A further 22% (5) data points featuring ill-defined particle morphology could not be employed in building the model, but served to verify the model predictions. In one case, we observed OTA samples that featured two distinct particle morphologies in comparable yields (Figure 3d). Such a case indicates a saddle-point in the chemical design space, a two-phase region where both morphologies are in coexistence, and should be approached with caution. Here, we characterized the two morphologies and computed their arithmetic average label y: such treatment reflected the dichotomy in the design space and was supplying this information in the model.

Experimental uncertainties are common in any practical work, and must be carefully considered. In our study, there were uncertainties associated with both OTA sample fabrication and characterization. While we made every effort to fix all aspects of OTA particle synthesis apart from pH and pKa, unaccounted differences in ambient conditions such as relative humidity could influence the evaporation rate during the experiments, affecting particle yields and morphologies. Changes in impurity content could also affect the observed morphologies. OTA particle dimensions were measured based on visual assignment of particle boundaries: these may introduce minor uncertainties into the mapping from design space to experimental outcome that are difficult to quantify. Irregular particle sizes in our experiments allowed us to perform a statistical analysis of particle dimensions (and thus morphologies). The standard deviations per particle dimension were combined to compute the overall uncertainty ∆ on the morphology label y. Since this quantity reflects the knowability of data, it was adopted to represent all sources of experimental error and served as data noise in the GPR surrogate model (see Supplementary Material for full details). For the precipitate yield, a conservative estimate of 5% variation was assumed.

Morphology landscapes in the design space

Based on GPR, we computed the initial surrogate model for OTA particle morphology in the 2D pH-pKa design space shown in Figure 4a. The continuous morphology landscape features areas of interest associated with low y signal (1D) and high y signal (3D) structures. It also indicates that there are regions of design space where no data have been collected and where the model may be less reliable.

Figure 4
figure 4

Gaussian process regression surrogate models for morphology label y in pH-pKa design space fitted with (a) 10 and (b) 17 experimental data points. Chart color reflects the value of the morphology label y, with yellow color denoting 3D and dark blue reflecting 1D particle outcome. Magenta circles indicate the loci of the actual experimental data. The red star indicates the processing conditions that produced oxidized tannic acid particles with the most pronounced 1D character (minimum y value in the surrogate model).

The minimum of the surrogate model in Figure 4a suggests that high-pH combined with high-pKa produced OTA particles with the most strongly pronounced 1D character (\(d_{1} \gg d_{2} , d_{3}\)). Conversely, low pH solutions most likely produced 3D particles. To verify these predictions, we sampled further data points at the edges of the design space at pH < 7.8 and pH > 11, and also at low pKa values, where data had been sparse. The GPR model that was re-trained with 7 additional experimental points is presented in Figure 4b.

The refined surrogate model for OTA particle morphology retains many of the features of the previous GPR fit in Figure 3a. The predicted high-pH and high-pKa conditions for 1D particles remain unchanged. However, the region specific to 3D structures (high y values) is now enhanced, shifting to lower pKa values. The refined landscape suggests that only low-pH and low-pKa processing conditions give rise to 3D particles. The relatively low value of the morphology signal y throughout the design space indicates that many experimental outcomes are 1D-like. Particles that are 2D-like may form only in the region of chemical space that neighbors the 3D structural conditions.

Model validation and predictive power

To extract predictions from the surrogate model, we coarse-grained the landscape into several categories assuming linear progression from 1D to 3D. As illustrated in Figure 5, this allows us to define regions of design space where experiments would reliably produce 1D, 2D, and 3D OTA particles. We observe that 1D and 3D regions of design space are clear and well separated. The model predicts that solution pH and pKa are directly correlated: 1D particles are obtained when their values are both high, and 3D when they are both low. In contrast, the 2D particle region spans a limited non-convex area in design space that conforms to the 3D particle region. This implies that 2D particles are difficult to synthesize. The greatest portion of design space was associated with 1D-type structures. The resulting model prediction is that when pH and pKa are inversely correlated, 1D-like or 1D-2D mixed morphology particles are expected to occur.

Figure 5
figure 5

Oxidized tannic acid particle morphology prediction by particle dimensionality, indicating mixed 1D–2D/2D–3D regions.

In the next step, we validate our model predictions by cross-referencing SEM images of OTA particles with the particle morphology landscape. Figure 6 portrays the landscape overlaid with SEM image data from the area of design space where the OTA particle synthesis was carried out. Images outlined in red represent cases of non-quantifiable particle dimensions (ill-defined morphology), which were not included in the model construction. The case of dual particle morphologies is indicated in green.

Figure 6
figure 6

Surrogate model of oxidized tannic acid morphology validated against experimental scanning electron microscope images. Images with red borders indicate experiments with ill-defined morphology while those in green indicate mixed morphologies.

It is immediately clear that the predictions regarding 1D and 3D particle formation were correct. 1D landscape regions are associated with very long needle-like particles (up to 0.1 mm), where the design condition \(d_{1} \gg d_{2} , d_{3}\) is best satisfied. 1D-like regions exhibit a different 1D morphology where the particles are short and matchstick-like. In some cases, the short 1D particles agglomerate into a larger mass where the morphology is not easily identified. These data were not included into the surrogate model, and yet they correlate well with the mixed morphology 1–2D and 2–3D regions of the landscape. The same is true of the dual morphology data points, which correctly occur in the mixed 1D–2D section of the landscape.

SEM images reveal few examples of 3D particles obtained in these experiments, about 25% of the total. Even fewer are the 2D particle cases, which present mostly as domino-like platelet structures. As predicted by the surrogate model, 1D particles dominate the design space: short matchstick-like structures are the most common experimental outcome. At intermediate pH and pKa values, there is a risk of particle aggregation: matchsticks combining into disordered bundles and coral-like growth are observed.

OTA particle yield and functional properties

Having demonstrated that GPR surrogate models for OTA particle morphology have good predictive power, we turn our attention to other experimental information. With each data point, we recorded the yield of the dried OTA colloidal content. The measurement of particle dimensions further allowed us to analyze and engineer other functional properties such as particle size, volume or surface area. The leading particle length in experiments varied in the range 0.4–130 µm, suggesting that experimental conditions can be used to tailor the particle size to diverse applications. We focused on the ratio of particle surface area to its volume: surface-based chemical processes underpin many technological applications, so maximizing surface area per volume (A/V) complements particle morphology control as an important design objective.

The GPR surrogate model for OTA particle yield is presented in Figure 7a. The irregular features in this landscape suggest that particle yield is strongly correlated with the base employed in the solution, rather than the pKa value. For example, applying LiOH (pKa 13.8) to OTA leads to relatively high yields, about 60%, but NaOH (pKa 14.8) causes the yield to drop below 10 percent. This observation suggests that particle yield may be better correlated with a different property of the base, such as its size. Solution pH does play a role in the particle yield, with largest yields observed in the pH range of 8–11.

Figure 7
figure 7

Surrogate models for oxidized tannic acid particle experimental properties: (a) particle yield and (b) particle A/V ratio, in the design space of pH and pKa processing conditions. Magenta circles indicate the locations of the experimental data points.

The OTA particle A/V landscape, illustrated in Figure 7b, presents a central region where the A/V ratio is very high. These mid-range pH and pKa conditions are associated with 2D particles, where experimental data are scarce. OTA particles synthesized in these conditions tend to produce 2D-like lamellar forms that agglomerate into 3D structures (see Figure 6 for SEM images). It was difficult to measure the shape of these particles, so they were not included in the surrogate model. Nevertheless, such samples clearly had the highest A/V ratio, and this was correctly predicted by the A/V surrogate model despite the paucity of data.

Extracting several surrogate models from the same experimental data (at no additional cost) allows us to cross-reference different properties and infer the conditions that would satisfy several design objectives at once. For example, a high yield of 3D particles can be obtained with NH4OH in low pH = 7 conditions. Highest yield of 1D OTA particles can be achieved with KOH at pH = 10–11, which also produces largest particles with most surface area exposed. 1D particles with high A/V ratio could be produced at very high pKa, but at relatively low yields. In further work, different label variables can be arithmetically combined into composite labels and landscapes.

Discussion and outlook

The purpose of this work was to evaluate the predictive power of GPR on a small experimental data set; therefore, we deliberately constrained the dimensionality of the problem, which also produced interpretable surrogate models. OTA particle morphology is certainly affected by other experimental parameters. Nevertheless, the good predictive power of surrogate models in the relatively simple 2D design space demonstrated that pH and pKa alone are sufficient to control particle morphology, in agreement with the earlier PCA result. Unfortunately, PCA was unable to provide insights into the morphology variation that could be achieved with surrogate models.

The morphology landscape portrays a very clear trend, but we are unable to interpret it using scientific intuition. The bottom-up OTA particle synthesis is a result of complex self-assembly where OTA particles coordinate into secondary supramolecular structures, which form tertiary nanofilaments and these assemble into quaternary mesoscopic crystals. It is very difficult to develop any inkling about the outcome of such an intricate procedure, nor about how processing conditions might affect it. Instead, the data-driven landscape can guide further research into the chemical processes behind such outcomes and advance fundamental understanding.

Surrogate models are of general value in materials design because they span all design space, are chemically intuitive and interpretable. It is difficult to establish the criteria for quantitative accuracy of surrogate models. Our work shows that qualitative accuracy already translates to good predictive power, marked by the good visual agreement between the morphology landscapes and the SEM images. OTA samples with ill-defined morphology (not included in the GPR) were particularly important in validating the model predictions. The correspondence of these mixed morphology samples with the appropriate regions on the map demonstrates that good quality ML predictions can be achieved in areas where no experiments were previously performed or included in the model.

The sensitivity of OTA particles to their processing conditions made them an ideal test case for this study, but they remain a challenging material to work with. The composition as well as the molecular structure of tannins are dependent on the source they were extracted from.43,44 In other words, the plant species and their physiological state dictate the polydispersity and molecular weight, giving rise to inevitable heterogeneity, which complicates the processing and characterization of the materials. The relatively high experimental uncertainties translated into data noise that amounted to 10% of the entire GPR model corrugation. Such noise did not impair the predictive power of the models in this study, but in other work experimental errors could lead to distorted models and less optimal fits.

The convergence of GP models is an important concern in experimental work where data set sizes are small. Typically, an iterative convergence procedure is followed. Here, the addition of further seven data points intended to verify model predictions had a small effect on the qualitative features of the model, so we stopped short of additional experiments. We note that good quality fits can be obtained with small data sets in the case of simple landscapes (few extrema) and very limited problem dimensionality (2D), thus avoiding the curse of dimensionality.45 The need for additional data can be also evaluated from the values of the GPR posterior variance, which tends to decrease with more data included in the model. We considered the OTA morphology model variance after 10 and 17 experimental points (see Figure S3). In this work, the relatively large experimental uncertainties translated into large values of GP variance, which remained unchanged with the addition of more data. This finding indicates that in GPR applications to experimental data, where large noise maintains high variance, GP posterior variance might not be a useful measure of model confidence. However, the variance could be used to guide additional experiments.

In further work, our GPR-based approach could be extended to active learning material design workflows. In BO,32,33 GPR variance is exploited by acquisition functions to select the sampling location that would most enhance the data set. Acquisition functions balance data exploration (searching less-visited areas of phase space) with data exploitation (searching near optimum points in phase space) to attain search objectives with relatively few data points. Search objectives can be learning the entire landscape or minimizing and maximizing materials properties across the search space.

By demonstrating that GPR performs well with experimental data related to OTA morphology design, this study opens the route toward BO with experimental data in engineering colloids. Integrating BO into experimental work is challenging,46,47,48 but there are many benefits.49,50 With acquisition functions guiding the selection of experiments, good predictive power of machine learning could be achieved with fewer experimental data points, facilitating the study of complex N-dimensional design spaces with more design variables. Moreover, BO allows to drive experimental data collection towards materials with preferred functional properties (morphological, mechanical or chemical) within the search space. The ML-guided search can thus replace trial-and-error experimental approach in materials design.

Conclusions

Supramolecular OTA constructs present a prospect of novel applications for this versatile and bioactive material. Controlling particle morphology will help us purpose the OTA particulates toward certain functions and application areas. This study combined materials engineering with GPR supervised machine learning to correlate the processing conditions of OTA colloidal solution with the morphology of the resulting dry OTA particles. The Bayesian surrogate model landscapes revealed the variation of particle morphology in the design space, illustrating the fabrication conditions needed to achieve different particle shapes. The main finding from the OTA morphology landscape is that severe processing conditions (high pH and pKa) give rise to extended 1D particles with high surface area per volume ratios. Reducing the severity of the solution produces smaller, compact 3D shapes.

Despite the relatively small data set size and large experimental uncertainty, the data-driven morphology landscape was in good agreement with OTA sample images. It exhibited considerable predictive power on samples that were not originally included in the model, marking the potential for predictive materials design. From the same set of experiments, we built surrogate models for OTA particle shape, yield, and surface-to-volume ratio, and cross-referenced them to demonstrate how multiple design objectives could be satisfied at once.

Mapping processing conditions directly to experimental properties of materials constitutes a practical approach to ML-led materials engineering, free of human bias. Such procedures could not only supplant experimental trial-and-error approaches, but also guide further research into the mechanisms of crystallization and self-assembly in complex materials, opening innovative engineering routes toward new phases of matter.