Big data and machine learning for materials science

Rodrigues, Jose F.; Florea, Larisa; de Oliveira, Maria C. F.; Diamond, Dermot; Oliveira, Osvaldo N.

doi:10.1007/s43939-021-00012-0

Big data and machine learning for materials science

Review
Open access
Published: 19 April 2021

Volume 1, article number 12, (2021)
Cite this article

Download PDF

You have full access to this open access article

Discover Materials Aims and scope Submit manuscript

Big data and machine learning for materials science

Download PDF

Jose F. Rodrigues Jr¹,
Larisa Florea²,
Maria C. F. de Oliveira¹,
Dermot Diamond³ &
…
Osvaldo N. Oliveira Jr⁴

23k Accesses
56 Citations
12 Altmetric
1 Mention
Explore all metrics

Abstract

Herein, we review aspects of leading-edge research and innovation in materials science that exploit big data and machine learning (ML), two computer science concepts that combine to yield computational intelligence. ML can accelerate the solution of intricate chemical problems and even solve problems that otherwise would not be tractable. However, the potential benefits of ML come at the cost of big data production; that is, the algorithms demand large volumes of data of various natures and from different sources, from material properties to sensor data. In the survey, we propose a roadmap for future developments with emphasis on computer-aided discovery of new materials and analysis of chemical sensing compounds, both prominent research fields for ML in the context of materials science. In addition to providing an overview of recent advances, we elaborate upon the conceptual and practical limitations of big data and ML applied to materials science, outlining processes, discussing pitfalls, and reviewing cases of success and failure.

Phase-field method of materials microstructures and properties

Article 03 June 2024

Methods of determining the degree of crystallinity of polymers with X-ray diffraction: a review

Article 28 September 2023

Structural characterization of polycrystalline thin films by X-ray diffraction techniques

Article 03 January 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The ongoing revolution with artificial intelligence has the potential to transform society well beyond applications in science and technology. A key ingredient is machine learning (ML), for which increasingly sophisticated methods have been developed, thus bringing an expectation that within a few decades, machines may be able to outperform humans in most tasks, including intellectual tasks. Two converging movements are responsible for the revolution. The first may be referred to as “data-intensive discovery” [1], “e-Science”, or “big data” [3,4], a movement wherein massive amounts of data are transformed into knowledge. This is attained with various computational methods, which increasingly engage ML techniques, in a movement characterized by a transition in which data move from a “passive” to an “active” role. What we mean by an “active” role for data is that it is not considered solely to confirm or refute a hypothesis but also to assist in raising new hypotheses to be tested at an unprecedented scale. The transition into a fully active role for data will only be complete when the computational methods (or machines) are capable of generating knowledge themselves. Within this novel paradigm, data must be organized in such a way as to be machine-readable, particularly since computers at present cannot “read” and interpret. Attempts to teach computers to read are precisely within the realms of the second movement, in which natural language processing tools are under development to process spoken and written text. Significant advances in this regard have been recently achieved upon combining ML and big data, as may be appreciated by the astounding progress in speech processing [5] and machine translation. Computational systems are still far away from the human ability to interpret text, but the increasingly synergistic use of big data and ML allows one to envisage the creation of intelligent systems that can handle massive amounts of data with analytical ability. Then, beyond the potential to outperform humans, machines would also be able to generate knowledge–without human intervention.

Regardless of whether such optimistic predictions will become a reality, big data and ML already have a significant impact owing to the generality of their approaches. To understand why this is happening, we need to distinguish the contributions of the two areas. Working with big data normally requires proper infrastructure, with major difficulties being associated with the gathering and curation of much data. In Materials Science, for instance, access to large databases and considerable computational power are essential, as exemplified in this review paper in the discussion of sensor networks. Standard ML algorithms, on the other hand, can operate on small datasets and in many cases require only limited computational resources. Furthermore, there are major limitations in terms of what ML can achieve by virtue of fundamental conceptual difficulties.

The goals of ML fall into two distinct types [6]: (i) classification of data instances in a large database, as in image processing and voice recognition; (ii) making inferences based on the organization and/or structuring of the data. Needless to say, the second goal is much harder to achieve. Let us consider, as an example, the application of ML to identify text authorship. In a classification experiment with tens of English literature books modeled as word networks, book authors were identified with high accuracy using supervised ML [7]. Nonetheless, it would be impossible with current technology to make a detailed analysis of writing style and establish correlations among authors, which would represent a task of the second type. This will require considerable new developments, such as teaching computers to read. Today, the success of ML stems mostly from applications focused on the first goal, which encompasses most of its applications in materials science. Nonetheless, much more can be expected in the next few decades, as we shall comment upon in our conclusive Sect. 5.

Considerable work has been devoted to addressing the challenges that arise when materials science meets big data and ML. The evolution in computing resources allows scientists to produce and manipulate unprecedented data volumes to be stored and managed via algorithms with embedded intelligence [8]. Data processing has become a much more complex endeavour than just storage and retrieval, as concepts such as data curation and provenance come into play, particularly if the data are to be machine-readable. Materials scientists already employ substantial amounts of machine-readable data in at least two major fields: in exploring protein databanks and in crystallography. These are illustrative examples of machine-readable content that require artificial intelligence.

In this review paper, we discuss two areas in Materials Science that are fundamental on their own but that also complement each other: ML-based discovery of new materials and ML-based analysis of chemical sensing compounds. The first presents the latest techniques to search the space of possibilities given by molecular interactions; the second reviews analytical methods to understand the properties of materials when used for sensing. They are complementary since new materials can lend sensing capabilities that, in turn, can produce more data to feed algorithmic methods for materials discovery. In addition to the acronym ML, we will use the acronyms AI for artificial intelligence, DL for deep learning, and DNN for deep neural networks - the latter two used interchangeably.

2 New trends in big data and machine learning relevant to materials sciences

Concepts and methodologies related to big data and ML have been employed to address many problems in materials sciences, as emphasized by the illustrative examples that we will discuss in the next sections. A description of the concepts and myths targeted at chemists and related professionals is given in the review by Richardson et al. [9]. In this section, we briefly introduce such concepts as background to assist the reader in following the paper.

2.1 Big data

The broadly advertised term “big data” has gained attention as a direct consequence of the rapid growth in the amount of data being produced in all fields of human activity. The magnitude of this increase is often highlighted even to the general public, as in a recent news piece by Forbes, which states that “There are 2.5 quintillion bytes of data created each day at our current pace, but that pace is only accelerating with the growth of the Internet of Things (IoT)” [10]. However, the term big data is not just about massive data production, a perspective that has popularized it as a jargon filled with expectations [3]. Big data also refers to a collection of novel software tools and analytical techniques that can generate real value by identifying and visualizing patterns from disperse and apparently unconnected data sources. Nevertheless, to grasp the genuine virtues and potential of big data in materials science, its meaning must be interpreted in connection with the specificities of this particular domain. Big data might be understood as a movement driven by technological advances that accelerated data generation to a pace sufficiently fast to move it beyond the capacity of existing resources centralized in a single company or institution. Although this accelerated pace raises many computational problems, it also introduces potential benefits, as innovations induced by big data problems will certainly lead to a range of entirely new scientific discoveries that would not be possible otherwise.

In Materials Sciences, big data can be exploited in many ways: in computer simulations, miniaturized sensors, combinatorial synthesis, in the design of experimental procedures and protocols with increased complexity, in the immediate sharing of experimental results via databases and the Internet, to name a few. In quantum chemistry, for example, there exists the ioChem-BD platform [11], a tool to manage large volumes of simulation results on chemical structures, bond energies, spin angular momentum, and other descriptive measures. The platform provides inspection tools, including versatile browsing and visualization for a minimum level of comprehension, in addition to techniques and tools for systematic analysis. In data-driven medicinal chemistry [4], investigators must face critical issues and factors that scale with the data, such as data sharing, modelling molecular behaviour, implementation and validation with experimental rigor, and defining and identifying ethical considerations.

Big data have often been characterized in terms of the so-called five Vs: volume, velocity, variety, veracity, and value [3, 4]. Although it is not a strong definition, this description somehow captures the characteristic properties of the big data scenario. As far as volume is concerned, size is relative across research fields – what is considered big in Materials Science may be small in computer science; what matters is to what extent the data is manageable and usable by those who need to learn from it. Materials Science produces big data volumes by means of techniques such as parallel synthesis [12], high-throughput screening (HTS) [13], and first-principle calculations as reported in notorious efforts on quantum chemistry [14], molecules in general as in the AFLOW project [15], and organic molecules as in the ANI-1 project [16]. Big data volumes in materials science also originate from compilations of the literature and patent repositories [17, 18]. Closely related is velocity, which refers to the pace of data generation and may affect the capability of drawing conclusions and identifying alternative experimental directions, demanding off-the-shelf analytical tools to support timely summarization and hypotheses validation. Variety refers to the diversity of data types and formats currently available. While materials science has the advantage of an established universal language to describe compounds and reactions, many problems arise when translating this language into computational models whose usage varies across research groups and even across individual researchers. Veracity in Materials Science is closely concerned with the potential lack of quality in data produced by imprecise simulations or collected from experiments not conforming to a sufficiently rigid protocol, especially when biological organisms are involved. Finally, value refers to the obvious urge for data that are trustworthy, precise, and conclusive.

The importance of big data for materials science is highlighted in several initiatives, such as the BIGCHEM project described by Tetko et al. [19]. Their work is illustrative of the issues in handling big data in chemistry and life sciences: it includes a discussion on the importance of data quality, the challenges in visualizing millions of data instances and the use of data mining and ML for predictions in pharmacology. Of particular relevance is the search for suitable strategies to explore billions of molecules, which can be useful in various applications, especially in the pharmaceutical industry, to reduce the massive cost of identifying new lead compounds [19].

2.2 Machine learning (ML)

In computer science, the standard approach is to use programming languages to code algorithms that “teach” the computer to perform a particular task. ML, in turn, refers to implementing algorithms that tell the computer how to “learn”, given a set of data instances (or examples) and some underlying assumptions. Computer programs such as those deployed over the DL frameworks can then execute tasks that are not explicitly defined in the code. As the very name implies, it depends on learning, a process that in humans takes years, even decades, and that often happens based on the observation of both successes and failures. It is thus implicit that such learning depends on a large degree of experimental support. In its most usual approach, ML depends on an extensive set of successful and unsuccessful examples that will mold the underlying learning algorithm. This is where ML and big data collide. The abundance of both data and computing capacity has brought feasibility to approaches that would not work otherwise due to a lack of sufficient examples to learn from and/or processing power to drive the learning process. In fact, specialists argue that data collection and preparation in ML can demand more effort than the actual design of the learning algorithms [20]. Nevertheless, solving the issues to build effective learning programs is worthwhile; computers handle datasets much larger than humans can possibly do, as they are not susceptible to fatigue; and, unless mistakenly programmed, they hardly ever make numerical errors.

ML is a useful approach to problems for which designing explicit algorithms is difficult or infeasible, as in the case of spam filtering or detecting meaningful elements in images. Such problems have a huge space of possible solutions; thus, rather than searching for an explicit solution, a more effective strategy is to have the computer progressively learn new patterns directly from examples. Many problems in materials science conform to this strategy, including protein structure prediction, virtual screening of host-guest binding behaviour, material design, property prediction, and the derivation of models for quantitative structure-activity relationships (QSARs). This last one is, itself, an ML practice based on classification and regression techniques; it is used, for example, to predict the biological activity of a compound having its physicochemical properties as input.

2.3 An overview of deep learning

The latest achievements in ML are related to techniques broadly known as deep learning (DL) achieved with deep neural networks (DNNs) [21], which outperform state-of-the-art algorithms in handling problems such as images and speech recognition (see Fig. 1a). DL algorithms rely on artificial neural networks (ANNs), a biology-inspired technique in which the underlying principle is to approximate complex functions by translating a large number of inputs into a proper output. The principles behind DL are not new, dating back to the introduction of the perceptron neural network in 1958. After decades of disappointing results in the 1980 and 1990 s, ANNs were revitalised with impressive innovations in 2012 in the seminal paper by Krizhevsky et al. [22] and their AlexNet architecture for image classification, inspired by the ideas of LeCun et al. [23]. The driving factors responsible for this drastic change in the profile of a 50-year research field were significant algorithmic advances coupled with huge processing power (thanks to GPU advances [24]), big data sets, and robust development frameworks.

Figure 1 illustrates the evolving popularity of DL and its applications. Figure 1a shows the rate of improvement in the task of image classification after the introduction of DL methods. Figure 1b shows the increasing interest in the topic in the number of indexed publications by the Institute for Scientific Information, while the increasing popularity of the major software packages [24], Torch, Theano, Caffe, TensorFlow, and Keras, is demonstrated in Fig. 1c.

A common method of performing DL is by means of deep feedforward networks or, simply, feedforward neural networks, a kind of multiplayer perceptron [25]. They work by approximating a function f*, as in the case of a classifier y = f* that maps an input vector x to a category y. Such mapping is formally defined as y = f(x;θ), whereas the network must learn the parameters θ that result in the best function approximation. One can think of the network as a pipeline of interconnected layers of basic processing units, the so-called neurons, which work in parallel; each neuron is a vector-to-scalar function. The model is inspired by neuroscience findings according to which a neuron receives input from many other neurons; each input is multiplied by a weight – the set of all the weights corresponds to the set of parameters θ. After receiving the vector of inputs, the neuron computes its activation value, a process that proceeds up to the output layer.

Initially, the network does not know the correct weights. To determine the weights, it uses a set of labelled examples so that every time the classifier y = f* misses the correct class, the weights are adjusted by back-propagating the error. During this back-propagation, a widely used method to adjust the weights is named gradient descent, which calculates the derivative of a loss function (e.g., mean squared error) for each weight and subtracts if from that weight. The adjustment repeats for multiple labelled examples and over multiple iterations until approximating the desired function. All this process is called the training phase, and the abundance of labeled data produced currently has drastically changed the domains in which ML can work upon.

With an appropriately designed and complex architecture, possibly consisting of dozens of layers and hundreds of neurons, an extensively trained artificial neural network defines a mathematical process whose dynamics are capable of embodying increasingly complex hierarchical concepts. As a result, the networks allow machines to mimic abilities once considered to be exclusive to humans, such as translating text or recognizing objects.

Once the algorithms are trained, they should be able to generalise and provide correct answers for new examples of a similar nature. In such an optimization process, it is important to balance the fit to the training data: overfitting and the model will make correct mappings only to the training examples; underfitting and the model will miss even previously seen examples. For sensor networks, for example, overfitting produces a very low error in the training set because the model encompasses both the noise and the real signal. This often results in a system that generalizes poorly, as the noise is random. Therefore, a compromise is required that involves using several different training sets [26] and, most importantly, regularization techniques [25].

2.4 The flow path towards data‐based scientific discovery in materials science

Figure 2 illustrates the standard flow path from data production to the outcome of ML. The first step concerns methods to produce sufficient data to feed computational learning methods. Such data must initially be analysed by a domain expert, who will classify, label, validate, or reject the results of an experiment or simulation. The preprocessing step can be quite laborious and critical in that, if not taken rigorously, it might invalidate the remainder of the process or compromise its results. Once the data are ready, it is necessary to iterate it over an ML method; such methods require a stage of learning (or training) in which the computer learns from the known results provided by the domain expert. During the training step, knowledge is transferred from the training data to the computer in the form of algorithmic settings that are specific to each method. A model is then learned, i.e., a mathematical abstraction that, if computationally executed, can be applied to new, unseen data and produce accurate classifications/regressions and correct inferences as final outcomes of the process.

3 Materials discovery

Using computational tools to discover new materials and evaluate material properties is an old endeavour. For instance, DENDRAL, the first documented project on computer-assisted organic synthesis, dates back to 1965 [27]. According to Szymkuc et al. [28]. The early enthusiasm observed in the initial attempts waned with successive failures to obtain reasonable predictive power that could be used to plan organic synthesis, so much so that teaching a computer to perform this task was at some point considered a “mission impossible” [28]. However, renewed interest has emerged with the enhanced capability afforded by big data and novel ML approaches, particularly since one may envisage–for the first time–the possibility of exploring a considerable amount of the space of possible solutions defined by the elements in the periodic table and the laws of reactivity. In doing so, the number of possible material structures is estimated to be 10¹⁰⁰, which is larger than the number of particles in the universe [29]. In this vastness, the discovery of new materials must face resource and time constraints [30, 31]. Expensive experiments must be well planned, ideally targeting lead structures with a high potential of generating new materials with useful properties. The power of ML in materials science offers the greatest potential in this scenario. Of course, this comes at a cost; these complex spaces must be mathematically modelled, and a significant number of the representative patterns must be available as examples for the ML system to learn from. ML classifiers can then be trained to predict material properties, as demonstrated for magnetic materials without requiring first-principles calculations [32].

Predicting material properties from basic physicochemical properties involves exploring quantitative structure-property relationships (QSPRs), analogous versions of QSAR for nonbiological applications [33,34,35]. There are many examples of ML applied in this domain. For example, the reaction outcomes from the crystallization of templated vanadium selenites were predicted with a support vector machine (SVM)-based model where the training set included a “dark” portion of unsuccessful reactions compiled from laboratory notebooks [36]. Briefly, SVM refers to a discriminative classifier formally defined by a separating hyperplane, a widely used technique in ML [8]. The prediction of the target compound was attained with a success rate of 89 %, higher than that obtained with human intuition (78 %) [36]. For inorganic solid-state materials, atom-scale calculations have been a major tool to help understand material behaviour and accelerate material discovery [37]. Some of the relevant properties, however, are only obtained at a very high computational cost, which has stimulated the use of data-based discovery. In addition to highlighting major recent advances, Ward and Wolverton [37] comment upon current limitations in the field, such as the limited availability of appropriate software targeted at the computation of material properties.

In the choice of an outline to describe contributions found in the chemistry literature related to materials discovery, we selected a subset of topics that exemplify the many uses of ML.

3.1 Large databases and initiatives

Materials genome initiatives and multi-institutional international efforts seeking to establish a generic platform for collaboration [38, 39, 40] set a hallmark for the importance of big data and ML in chemistry and materials sciences. A major goal is perhaps to move beyond the trial-and-error empirical approaches prevailing in the past [35]. For example, the US Materials Genome Initiative (MGI) (https://www.mgi.gov/) established the following issues as major challenges [41]:

Lead a culture shift in materials research to encourage and facilitate an integrated team approach;
Integrate experiment, computation, and theory and equip the materials community with advanced tools and techniques;
Make digital data accessible; and.
Create a world-class materials workforce that is trained for careers in academia or industry.

To encourage a cultural change in materials research, ongoing efforts intend to generate data that can serve both to validate existing models and to create new, more sophisticated models with enhanced predictive capabilities. This has been achieved with a virtual high-throughput experimentation facility involving a national network of labs for synthesis and characterization [41] and partnerships between academia and industry for tackling specific applications. For example, the Center for Hierarchical Materials Design is developing databases for materials properties and materials simulation software [41], and alliances have been established to tackle topics such as Materials in Extreme Dynamic Environments and Multiscale Modeling of Electronic Materials [41]. There are also studies of composite materials to improve aircraft fuel efficiency and of metal processing to produce lighter weight products and vehicles [41]. Regarding the integration of experiment, theory, and computer simulation, perhaps the most illustrative example is an automated system designed to create a material, test it and evaluate the results, after which the best next experiment is chosen in an iterative procedure. The whole process was conducted without human intervention. The system is already in use to speed up the development process of high-performance carbon nanotubes for use in aircraft [41]. Another large-scale program at the University of California, Berkeley, used high-performance computing and state-of-the-art theoretical tools to produce a publicly available database of the properties of 66,000 new and predicted crystalline compounds and 500,000 nanoporous materials [41].

There are cases that require combining different levels of theory and modelling with experimental results, particularly for more complex materials. In the Nanoporous Materials Genome Center, microporous and mesoporous metal-organic frameworks and zeolites are studied for energy-relevant processes, catalysis, carbon capture, gas storage, and gas- and solution-phase separation. The theoretical and computational approaches range from electronic structure calculations combined with Monte Carlo sampling methods to graph theoretical analysis, which are assembled into a hierarchical screening workflow [41].

An important feature of MGI is the provision of infrastructure for researchers to report their data in a way that it can be curated. Programs such as the Materials Data Repository (MDR) and Materials Resource Registry are being developed to allow for worldwide discovery, in some cases based on successful resources from other communities, such as the Virtual Astronomical Observatory’s registry [41]. This component of the “Make digital data accessible” goal of MGI has already provided extensive datasets, e.g., for compounds (~ 1,500 compounds) to be used in electrodes for ion-lithium batteries and over 21,000 organic molecules for liquid electrolytes. These programs ensure immediate access by the industry to data that may help to accelerate material development in applications such as hydrogen fuel cells, pulp, and paper industry and solid-state lighting [41]. Additionally, the QM database contains the ground-state electronic structures for 3 million molecules and 10 low-lying excited states for more than 3.5 million molecules (e.g., “water”, “ethanol”, “ethyl alcohol”) in the “PubChemQC” project (http://pubchemqc.riken.jp/) [42]. For this database, the ground state structures were calculated with density-functional theory (DFT) at the B3LYP/6-31G* level, while the time-dependent DFT with the B3LYP functional and 6–31 + G* basis set was used for the excited states. The project also employs ML (SVMs and regression) for predicting DFT results related to the electronic structure of molecules.

3.2 Identification of compounds with genetic algorithms

In bioinspired computation, computer scientists define procedures that mimic mechanisms observed in natural settings. This is the case for the genetic algorithms [43] inspired by Charles Darwin’s ideas of evolution, which mimic the “survival of the fittest” principle to set up an optimization procedure. In these algorithms, known functional compounds are crossed over along with a mutation factor to produce novel compounds; the mutation factor introduces new properties into the mutations. Novel compounds with no useful properties are disregarded, while those displaying useful properties (high fitness) are selected to produce new combinations. After a certain number of generations (or iterations), new functional compounds emerge with some properties inherited from their ancestors, supplemented with other properties acquired along their mutation pathway. Of course, this is an oversimplified description of the process, which depends on accurate modelling of the compounds, a proper definition of the mutation procedure, and a robust evaluation of the fitness property. The latter may arrive by means of calculations, as in the case of conductivity or hardness, reducing the need for expensive experimentation.

In genetic algorithms, each compositional or structural characteristic of a compound is interpreted as a gene. Examples of chemical genes include the fraction of individual components in a given material, polymer block sizes, monomer compositions, and processing temperature. The genome refers to the set of all the genes in a compound, while the resulting properties of a genome are named a phenotype. The task of a genetic algorithm is to scan the search space of the gene domains to identify the most suitable phenotypes, as measured by a fitness function. The relationship between the genome and the phenotype gives rise to the fitness landscape (see Tibbetts et al. [2] for a detailed background). Figure 3 illustrates a fitness landscape for two hypothetical genes, say, block size and processing temperature of a polymer synthesis process whose aim is to achieve high rates of hardness. Note that when exploring a search space, the fitness landscape is not known in advance and rarely only bidimensional; instead, it is implicit in the problem model defined by the genes’ domains and in the definition of the fitness function. The modelling of the problem is correct if the genes permit gainful movement over the fitness landscape, while the fitness function correlates with interesting physical properties. In the example in Fig. 3, the genetic algorithm moves along the landscape by producing new compounds while avoiding compounds that will not improve the fitness. The mechanism of the genetic algorithm, therefore, grants it a higher probability of moving towards phenotypes with the desired properties. For a comprehensive review focused on materials science, please refer to the work of Paszkowicz [44].

The identification of more effective catalysts has also benefited from genetic algorithms, as in the work by Wolf et al. [45] where a set of oxides (B₂O₃, Fe₂O₃, GaO, La₂O₃, MgO, MnO₂, MoO₃, and V₂O₅) was taken as the initial population of an evolutionary process. Aimed at finding the catalysts that would optimize the conversion of propane into propene through dehydrogenation, the elements of the initial set were iteratively combined to produce four generations of catalysts. In total, the experiment produced 224 new catalysts with an increase of 9 % in conversion (T = 500 °C, C₃H₈/O₂, p(C₃H₈) = 30 Pa). A thorough review of evolutionary methods in searching for more efficient catalysts was given by Le et al. [29]. Bulut et al. [46] explored the use of polyimide solvent-resistant nanofiltration membranes (phase inversion) to produce membrane-like materials. The aim was to optimize the composition space given by two volatile solvents (tetrahydrofuran and dichloromethane) and four nonsolvent additives (water, 2-propanol, acetone, and 1-hexanol). This system was modeled as a genome with eight variables corresponding to a search space of 9 × 10²¹ possible combinations, which could not be exhaustively scanned regardless of the screening method. The solution was to employ a genetic algorithm driven by a fitness function defined by the membrane retention and permeance. Throughout four generations and 192 polymeric solutions, the fitness function indicated an asymptotic increase in the membrane performance.

3.3 Synthesis prediction using ML

The synthesis of new compounds is a challenging task, especially in organic chemistry. The search for machine-based methods to predict which molecules will be produced from a given set of reactants and reagents started in 1969, with Corey and Wipke [47], who demonstrated that synthesis (and retrosynthesis) predictions could be carried out by a computing machine. Their approach was based on templates produced by expert chemists that defined how atom connectivity would rearrange, given a set of conditions–see Fig. 4. Despite demonstrating the concept, their approach suffered from limited template sets, which prevented their method from encompassing a wide range of conditions and that would fail in the face of even the smallest alterations.

The use of templates (or rules) to transfer knowledge from human experts to computers, as seen in the work of Corey and Wipke, corresponds to an old computer science paradigm broadly referred to as “expert systems” [49]. This approach has attained limited success in the past due to the burden of producing sufficiently comprehensive sets of rules capable of yielding results over a broad range of conditions, coupled with the difficulty of anticipating exceptional situations. Nevertheless, it has gained renewed interest recently, as ML methods may contribute to automatic rule generation taking advantage of large datasets, as it is being explored, for instance, in medicine [50].

In association with big data, ML became an alternative to extracting knowledge not only from experts but also from datasets. Coley et al. [51], for example, used a 15,000-patent dataset to train a neural network to identify the sequence of templates that would most likely produce a given organic compound during retrosynthesis. Segler and Waller [52] used 8.2 million binary reactions (including 14.4 million molecules) acquired from the Reaxys web-based chemistry database (https://www.reaxys.com) to build a knowledge graph, a bipartite directed graph G=(M,R,E), made of two sets of nodes, where M stands for the set of molecules and R for the set of reactions, plus one set E of labelled edges, each representing a role t ∈ {reactant, reagent, catalyst, solvent, product}. A schema of the approach is depicted in Fig. 5. A link prediction ML algorithm was employed, which predicts new edges from the characteristics of existing paths within the graph structure given by edges of type reactant. In the example depicted in Fig. 5b, the reactant path between molecules 1 and 4 indicates a missing reaction node between them, i.e.,. node D in Fig. 5c. The experiments confirmed a high accuracy in predicting the products of binary reactions and in detecting reactions that are unlikely to occur.

Owing to the limitations of ad hoc procedures relying on templates, Szymkuc et al. [28] advocated that for chemical syntheses, the chemical rules from stereo- and regiochemistry may be coded with elements of quantum mechanics to allow ML methods to explore pathways of known reactions from a large database. This tends to increase the data space to be searched. From recent literature, not restricted to the chemical field, DL appears to be the most promising approach for successfully exploring large search spaces [53, 54] and for reaching an autonomous molecular design [55]. Schwaller et al. [56] used DL methods to predict outcomes of chemical reactions and found the approach suitable to assimilate latent patterns for generalizing out of a pool of examples, even though no explicit rules were produced. Assuming that organic chemistry reactions sustain properties similar to those studied in linguistic theories [57], they explored state-of-the-art neural networks to translate reactants into products similarly to how translation is performed from one language into another. In their work, a DNN was trained over Lowe’s dataset of US patents, which contains patents applied between 1976 and 2016 [58], including 1,808,938 reactions described using the SMILES [59] chemical language, which defines a notation system to represent molecular structures as graphs and strings amenable to computational processing. Jin’s dataset [48], a cleaned version of Lowe’s dataset after removing duplicates and erroneous reactions, with 479,035 reactions, was also used. They achieved an accuracy of 65.4 % for single-product reactions over the entire Lowe’s dataset and an accuracy of 80.3 % over Jin’s dataset. Bombarelli et al. [60] also combined the SMILES notation and a DNN to map discrete molecules into a continuous multidimensional space in which the molecules are represented as vectors. In such a continuous space, it is possible to predict the properties of the existing vectors and predict new vectors with certain properties. Gradients are computed to indicate where to look for vectors whose properties vary in the desired way, and optimization techniques can be employed to search for the best candidate molecules. After finding new vectors, a second neural network converts the continuous vector representation into a SMILES string that reveals a potential lead compound. DNNs were also employed to predict reactions with 97 % accuracy on a validation set of 1 million reactions, clearly showing superior performance to previous rule-based expert systems [61].

With respect to inorganic chemistry, Ward et al. [62] survey the models to predict the melting points of binary inorganic compounds; the formation enthalpy of crystalline compounds; the crystal structures that are likely to form at certain compositions; band-gap energies of specific classes of crystals; and the properties of metal alloys regarding mechanical features. Nevertheless, according to Ward et al., there are no widely used machine learning models for band-gap energy or glass-forming ability, even though large-scale databases with the corresponding properties have been available for years.

3.4 Quantum chemistry

The high computational cost of quantum chemistry has been a limiting factor in exploring the virtual space of all possible molecules from a quantum perspective. This is the reason why researchers are increasingly resorting to ML approaches [63, 64]. Indeed, ML has been used to replace or supplement quantum mechanical calculations for predicting parameters such as the input for semiempirical QM calculations [65], modeling electronic quantum transport [66], or establishing a correlation between molecular entropy and electron correlation [63]. ML can be employed to overcome or minimize the limitations of ab initio methods [67] such as DFT, which are useful to determine chemical reactions and quantum interactions between atoms and molecular and material properties but are not suitable to treat large or complex systems [68].

The coupling of ANNs and ab initio methods is exemplified in the PROPhet project [68] (PROPerty Prophet) for establishing nonlinear mappings between a set of virtually any system property (including scalar, vector, and/or grid-based quantities) and any other property. PROPhet provides, among its functionalities, the ability to learn analytical potentials, nonlinear density functions, and other structure-property or property-property relationships, reducing the computational cost of determining material properties, in addition to assisting in the design and optimization of materials [68]. Quantum chemistry-oriented ML approaches have also been used to predict the sites of metabolism for cytochrome P450 with a descriptor scheme where a potential reaction site was identified by determining the steric and electronic environment of an atom and its location in the molecular structure [69].

ML algorithms can accelerate the determination of molecular structures via DFT, as described by Pereira et al. [70], who estimated HOMO and LUMO orbital energies using molecular descriptors based only on connectivity. Another aim was to develop new molecular descriptors, for which a database containing > 111,000 structures was employed in connection with various ML models. With random forest models, the mean absolute error (MAE) was smaller than 0.15 and 0.16 eV for the HOMO and LUMO orbitals, respectively [70]. The quality of estimations was considerably improved when the orbital energy calculated by the semiempirical PM7 method was included as an additional descriptor.

The prediction of crystal structures is among the most important applications of high-throughput experiments [71], which rely on ab initio calculations. DFT has been combined with ML to exploit interatomic potentials for searching and predicting carbon allotropes [72]. In this latter method, the input structural information comes from liquid and amorphous carbon only, with no prior information on crystalline phases. The method can be associated with any algorithm for structure prediction, and the results obtained using ANNs were orders of magnitude faster than with DFT [72].

With a high-throughput strategy, time-dependent DFT was employed to predict the electronic spectra of 20,000 small organic molecules, but the quality of these predictions was poor [73]. Significant improvement was attained with a specific ML method named Ansatz, which was employed to determine low-lying singlet-singlet vertical electronic spectra, with excitation reproduced within ± 0.1 eV for a training set of 10,000 molecules. Significantly, the prediction error decreased monotonically with the size of the training set [73]; this experiment opened the prospect for addressing the considerably more difficult problem of determining transition intensities. As a proof-of-principle exercise, accurate potential energy surfaces and vibrational levels for methyl chloride were obtained in which ab initio energies were required for some nuclear configurations in a predefined grid [74]. ML using a self-correcting approach based on kernel ridge regression was employed to obtain the remaining energies, reducing the computational cost of the rovibrational spectra calculation by up to 90 % since tens of thousands of nuclear configurations could be determined within seconds [74]. ANNs were trained to determine spin-state bond lengths and ordering in transition metal complexes, starting with descriptors obtained with empirical inputs for the relevant parameters [75]. Spin-state splittings of single-site transition metal complexes could be obtained within 3 kcal mol^− 1, an accuracy comparable to that of DFT calculations. In addition to predicting structures validated with ab initio calculations, the approach is promising for screening transition metal complexes with properties tailored for specific applications.

The performance of ML methods for the given applications in quantum chemistry has been assessed via contests, as often done in computer science. An example is the Critical Assessment of Small Molecule Identification (CASMI) Contest (www.casmi-contest.org) [76], in which ML and chemistry-based approaches were found to be complementary. Improvements in fragmentation methods to identify small molecules are considerable and should be further improved in the coming years with the integration of further high-quality experimental training data [76].

According to Goh et al. [35], DNNs have been used in quantum chemistry so far to a more limited extent than they have in computational structural biology and computer-aided drug design, possibly because the extensive amounts of training data they require may not yet be available. Nevertheless, Goh et al. state that such methods will eventually be applied massively for quantum chemistry, owing to their observed superiority in comparison to traditional ML approaches – an opinion we entirely support. For example, DNNs applied to massive amounts of data could be combined with QM approaches to yield accurate QM results for a considerably larger number of compounds than is feasible today [63].

Though the use of DNNs in quantum chemistry may still be at an embryonic stage, it is possible to identify significant contributions. Tests have been made mainly with the calculation of atomization energies and other properties of organic molecules [35] using a portion of 7,000 compounds from a library of 10⁹ compounds, where the energies in the training set were obtained with the PBE0 (Perdew–Burke-Ernzerhof (PBE) exchange energy and Hartree-Fock exchange energy) hybrid function [77]. DNN models yielded superior performance compared to other ML approaches since a DNN could successfully predict static polarizabilities, ionization potentials and electron affinity, in addition to atomization energies of the organic molecules [78]. Significantly, the accuracy was similar to the error of the corresponding theory employed in QM calculations to obtain the training set. Applying DNNs to the dataset of the Harvard Clean Energy Project to discover organic photovoltaic materials, Aspuru-Guzik et al. [79] predicted HOMO and LUMO energies and power conversion efficiency for 200,000 compounds, with errors below 0.15 eV for the HOMO and LUMO energies. DL methods have also been exploited in predicting ground- and excited state properties for thousands of organic molecules, where the accuracy for small molecules can be even superior to QM ab initio methods [78]. Recent advances in the use of machine learning and computational chemistry methods to study organic photovoltaics are discussed in other works [80, 81, 82].

3.5 Computer‐aided drug design

Drug design has relied heavily on computational methods in a number of ways, from computer calculations of quantum chemistry properties with ab initio approaches, as previously discussed, to screening processes in high-throughput analysis of families of potential drug candidates. Huge amounts of data have been gathered over the last few decades with a range of experimental techniques, which may contain additional information on the material properties. This is the case for mass spectrometry datasets that may contain valuable hidden information on antibiotics and other drugs. In the October 2016 issue of Nature Chemical Biology [83], the use of big data concepts was highlighted in the discovery of bioactive peptidic natural products via a method referred to as DEREPLICATOR. This tool works with statistical analysis via Markov chain-Monte Carlo to evaluate the match between spectra in the database of Global Natural Products Social infrastructure (containing over one hundred million mass spectra) with those from known antibiotics. Crucial for the design of new drugs are their absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties, on which the pharmacokinetic profile depends [84]. Determining ADMET properties is not feasible when such a large number of drug candidates are to be screened; therefore, computational approaches are the only viable option. For example, prediction results generated from various QSAR models can be compared to experimentally measured ADMET properties from databases [85,86,87,88,89,90,91]. These models are limited in that they may not be suitable to explore novel drugs, which motivates an increasing interest in ML methods, which can be trained to generate predictive models that may discover implicit patterns from new data used to determine more accurate models [84]. Indeed, ML predictive models have been used to identify potential anti-SARS-CoV-2 drugs, particularly viral proteins as targets [92], and to evaluate drug toxicity [93].

Pires and Blundell developed the approach named pkCSM (http://structure.bioc.cam.ac.uk/pkcsm), in which the ADMET properties of new drugs can be predicted with graph-based structural signatures [84]. In pkCSM, the graphs are constructed by representing atoms as nodes, while the edges are given by their covalent bonds. Additionally, the labels used to decorate the nodes and edges with physicochemical properties are essential, similar to the approach used in embedded networks [5]. The concept of structural signatures is associated with establishing a signature vector that represents the distance patterns extracted from the graphs [84]. The workflow for pkCSM is depicted in Fig. 6, which involves two sets of descriptors for input molecules: general molecule properties and the distance-based graph signature. The molecular properties include lipophilicity, molecular weight, surface area, toxicophore fingerprint, and the number of rotatable bonds.

QSAR [94,95,96,97, 98] as in computer-aided drug design that predicts the biological activity of a molecule, is ubiquitous in some of these applications. The inputs are typically the physicochemical properties of the molecule. The use of DL for QSAR is relatively recent, as typified in the Merck challenge [99], wherein the activity of 15 drug targets was predicted in comparison to a test set. In later work, this QSAR experiment was repeated with a dataset curated from PubChem containing over 100,000 data points, for which 3,764 molecular descriptors per molecule were used as DNN input features [100]. DNN models applied to the Tox21 challenge provided the highest performance [101], with 99 % of neurons in the first hidden layer displaying significant association with toxicophore features. Therefore, DNNs may be used to discover chemical knowledge by inspecting the hidden layers [101]. Virtual screening is also relevant to complement docking methods for drug design, as exemplified with DNNs to predict the activity of molecules in protein binding pockets [102]. In another example, Xu et al. [103] employed a dataset with 475 drug descriptions to train a DNN to predict whether a given drug may induce liver injury. They used their trained predictor over a second dataset with 198 drugs, of which 172 were accurately classified with respect to their liver toxicity. This type of predictor affords significant time and cost savings by rendering many experiments unnecessary. This practice in the QSAR domain is potentially useful for the nonbiological quantitative structure-property relationship (QSPR) [34], in which the goal is to predict physical properties departing from simpler physicochemical properties. Despite the existence of interesting works on the topic [75, 79], there is room for further research.

A highly relevant issue that can strongly benefit from novel procedures standing on big data and classical ML methods is drug discovery for neglected diseases. Cheminformatics tools have been assembled into a web-based platform in the project More Medicines for Tuberculosis (MM4TB), funded by the European Union [104]. The project relies on classical ML methods (Bayesian modelling, SVMs, random forest, and bootstrapping), collaboratively working on data acquired from the screening of natural products and synthetic compounds against the microorganism Mycobacterium tuberculosis.

Self-organizing maps (SOMs) are a particular type of ANN that has been proven useful in a rational drug discovery process, where they assist in predicting the activity of bioactive molecules and their binding to macromolecular targets [105]. Antimicrobial peptide (AmP) activity was predicted using an adaptive neural network model with the amino acid sequence as input data [106]. The algorithm iterated to optimize the network structure, in which the number of neurons in a layer and their connectivity were free variables. High charge density and low aggregation in solution were found to yield higher antimicrobial activity. In another example of antimicrobial activity prediction, ML was employed to determine the activity of 78 sequences of antimicrobial peptides generated through a linguistic model. In this, the model treats the amino acid sequences of natural AmPs as a formal language described by means of a regular grammar [107]. The system was not efficient in predicting the 38 shuffled sequences of the peptides, a failure attributed to their low specificity. The authors [107] concluded that complementary methods with high specificity are required to improve prediction performance. An overview of the use of ANNs for drug discovery, including DL methods, is given by Gawehn et al. [108].

To a lesser extent than for drug design, ML is also being employed for modelling drug-vehicle relationships, which are essential to minimize toxicity [109]. Authors employed ML on data from the National Institute of Health’s (NIH) Developmental Therapeutics Program to build classification models and predict toxicity profiles for a drug. That is, they employed the random forest classifier to determine which drug carriers led to the least toxicity with a prediction accuracy of 80 %. Since this method is generic and may be applied to wider contexts, we see great potential in its use in the near future owing to the increasing possibilities introduced by nanotech-based strategies for drug delivery. To realize such potential, important knowledge gaps related to nanomaterials, immune responses and immunotherapy will need to be filled [110].

As in many other application areas, beyond materials science or pharmaceutical research, the performance of distinct ML methods has been evaluated according to their performance in solving a common problem. It now seems that DL may perform better than other ML methods [111], especially in cases where large datasets have been compiled over the years. Ekins [111] listed a number of applications of DNNs in the pharmaceutical field, including prediction of aqueous solubility of drugs, drug-induced liver injury, ADME and target activity, and cancer diagnosis. In a more recent work, Korotcov and collaborators [112] showed that DL yielded superior results compared to SVMs and other ML approaches in the prediction ability for drug discovery and ADME/Tox data sets. The results are presented for Chagas disease, tuberculosis, malaria, and bubonic plague.

DL has also succeeded in the problem of protein contact prediction [113]. In 2012, Lena et al. [114] superseded the previously impassable mark of 30 % accuracy for the problem. They used a recursive neural network trained over a 2,356-element dataset from the ASTRAL database [115], a big data compendium of protein sequences and relationships. Then, they tested their network over 364 protein folds, achieving the first-time-ever mark of 36 % accuracy, which brought new hope to this complex field.

4 Sensor‐based data production for computational intelligence

The term Internet of Things (IoT) was coined at the end of the 20th century to mean that any type of device could be connected to the Internet, thus enabling tasks and services to be executed remotely [116]. In other words, the functioning of a device, appliance, etc. could be monitored and/or controlled via the Internet. If (almost) any object can be connected, three immediate consequences can be identified: (i) sensing must be ubiquitous; (ii) huge volumes of data will be generated; (iii) systems will be required to process the data and make use of the network of connected “things” for specific purposes. There is a virtually endless list of possible services, ranging across traffic control, health monitoring, surveillance, precision agriculture, control of manufacturing processes, and weather monitoring. In an example of sensors and sensing networks for monitoring health and the environment with wearable electronics, Wang et al. [117] emphasized the need to develop new materials for meeting the stringent requirements to develop IoT-related applications.

A comprehensive review on chemical sensing (or IoT) is certainly beyond the scope of this paper, and we shall, therefore, restrict ourselves to providing some illustrative examples on how chemical sensors are producing big data to make the point that sensors and biosensors are key to providing the data needed to solve problems by means of ML. Indeed, methods akin to big data and ML have been employed for analysing data from sensors and biosensors through computer-assisted diagnosis for the medical area and other areas where diagnosis relies on sensing devices, such as in fault prediction in industrial settings.

4.1 ML in sensor applications

Materials Science is essential for IoT sensing and biosensing, as well as in intelligent systems, for a variety of reasons, including the development of new materials for building innovative chemical (and electrochemical) sensing technologies (see, for instance, the review paper by Oliveira Jr et al. [118]). In recent decades, increasingly complex chemical sensing has produced data volumes from a wide range of analytical techniques. There has been a tradition in chemistry – probably best represented by contributions in chemometrics – to employ statistics and computational methods to treat not only sensing data but also other types of analytical data. Electronic noses (e-noses) are an illustrative example of the use of ML methods in sensing and biosensing [119, 120]. Ucar et al. [121] built an android nose to recognize the odor of six fruits by means of sensing units made of metal-oxide semiconductors whose output was classified using the Kernel Extreme Learning Machines (KELMs) method. In another work [122], the authors introduced a framework for multiple e-noses with cross-domain discriminative subspace learning, a more robust architecture for a wider odor spectrum. Robust e-sensing is also present in the work of Tomazzoli et al. [123], who employed multiple classification techniques, such as partial least squares-discriminant analysis (PLS-DA), k-nearest neighbors (kNN), and decision trees, to distinguish between 73 samples of propolis collected over different seasons based on the UV-Vis spectra of hydroalcoholic extracts. The relevance of this study lies in establishing standards for the properties of propolis, a biomass produced by bees and widely employed as an antioxidant and antibiotic due to its amino acids, vitamins, and bioflavonoids. As with many natural products, propolis displays immense variability, including dependence on the season when it is collected, so that excellent quality control must be ensured for reliable practical use in medicine. Automated classification approaches represent, perhaps, the only possible way to attain low-cost quality control for natural products that are candidate materials for cosmetics and medicines. The quantification of extracellular vesicles and proteins, as biomarkers for various diseases, was achieved with a combination of impedance spectroscopy measurements and machine learning [124].

Disease monitoring and control are essential for agriculture, as is the case for orange plantations, particularly when diseases and deficiencies may yield similar visual patterns. Marcassa and coworkers [125] took images obtained from fluorescence spectra and employed SVMs and ANNs to distinguish between samples affected by Huanglongbing (HLB) disease and those with zinc deficiency stress. The ability to process large amounts of data and identify patterns allows one to integrate sensing and classifying tasks into portable devices such as smartphones. Mutlu et al. [126], for instance, used colored-strip images corresponding to distinct pH values to train a least-squares-SVM classifier. The results indicated that the pH values were determined with high accuracy.

The visualization of bio(sensing) data to gain insight, support decisions or, simply, acquire a deeper understanding of the underlying chemical reactions has been exploited extensively by several research groups, as reviewed in the papers by Paulovich et al. [127] and Oliveira et al. [128]. Previous results achieved by applying data visualization techniques to different types of problems point to potentially valuable traits when chemical data are inspected from a graphical perspective. Possible advantages of this approach include:

(i)
The whole range of features describing a given dataset of sensing experiments can be used as the input to multidimensional projection techniques without discarding information at an early stage that might otherwise be relevant for a future classification For example, in electrochemical sensors, rather than using information about oxidation/reduction peaks, entire voltammograms may be considered; in impedance spectroscopy, instead of taking the impedance value at a given frequency, the whole spectrum can be processed to obtain a visualization.
(ii)
Other multidimensional visualization techniques, such as parallel coordinates [129], allow identification of the features that contribute most significantly to the distinguishing ability of the bio(sensor).
(iii)
Various multidimensional projection techniques are available, including nonlinear models, which in some cases have been proven to be efficient for handling biosensing data [127]. Such usage is exemplified with an example in which impedance-based immunosensors were employed to detect the pancreatic cancer biomarker CA19-9 [130].

The feature selection mechanism used by Thapa et al. [130] performed via manual visual inspection and combined with the silhouette coefficient (a measure of cluster quality), was demonstrated to enhance the immunosensor performance. However, more sophisticated approaches can be employed, as in the work by Moraes et al. [131], in which a genetic algorithm was applied to inspect the real and imaginary parts of the electric impedance measured by two sensing units. The method was capable of distinguishing triglycerides and glucose by means of well-characterized visual patterns. Using predictive modelling with decision trees, Aileni [132] introduced a system named VitalMon, designed to identify correlations between parameters from biomedical sensors and health conditions. An important tenet of the design was data fusion from different sources, e.g., a wireless network, and sensed data related to distinct parameters, such as breath, moisture, temperature, and pulse. Within this same approach, data visualization was combined with ML methods [133]] for the diagnosis of ovarian cancer using input data from mass spectrometry.

This type of analytical approach is key for electronic tongues (e-tongues) and e-noses, as these devices take the form of arrays of sensing units and generate multivariate data [134]. For example, data visualization and feature selection were combined to process data from a microfluidic e-tongue to distinguish between gluten-free and gluten-containing foodstuffs [135]. ML methods can also be employed to teach an e-tongue whether a taste is good or not, according to human perception. This has been done for the capacitance data of an e-tongue applied to Brazilian coffee samples, as explained by Ferreira et al. [136]. In that paper, the technique yielding the highest performance was referred to as an ensemble feature selection process based on the random subspace method (RSM). The suitability of this method for predicting coffee quality scores from the impedance data obtained with an e-tongue was supported by the high correlation between the predicted scores and those assigned by a panel of human experts.

4.2 Providing data for big data and ML applications with chem/biosensor networks

As discussed, sensors and biosensors are crucial to provide information at the core of big data and ML. Large-scale deployments of essentially self-sustaining wireless sensor networks (WSNs) for personal health and environment monitoring, whose data can be mined to offer a comprehensive overview of a person’s or ecosystem’s status, were anticipated long ago [137]. In this vision, large numbers of distributed sensors continuously collect data that are further aggregated, analysed, and correlated to report upon real-time changes in the quality of our environment or an individual’s health. At present, deployments of chemical WSNs are limited in scale, and most of the sensors employed rely on the modulation of physical properties, such as temperature, pressure, conductivity, salinity, light illumination, moisture, or movement/vibration, rather than chemical measurements. In environmental monitoring, there are examples of relatively large-scale deployments that encompass forest surveillance (e.g., GreenOrbs WSN with approximately 5,000 sensors connected to the same base station [138]), vineyard monitoring [139,140,141,142], volcanic activity monitoring [143], greenhouse monitoring [144, 145], soil moisture monitoring [146], water status monitoring [147], animal migration [148, 149] and marine environment monitoring [150, 151], among others [152]. In the wireless body sensor network (WBSN) arena, with over two-thirds of the world´s population already connected by mobile devices [153], the potential impact of WSNs and IoT on human performance, health and lifestyle is enormous. While numerous wearable technologies specific to fitness, physical activity and diet are available, studies indicate that devices that monitor and provide feedback on physical activity may not offer any advantage over standard approaches [154]. These studies suggest that ML approaches may be required to generate a meaningful and effective improvement in an individual’s lifestyle. However, physical sensors offer only a limited perspective of the environmental status or individual’s condition. A much fuller picture requires more specific molecular information, an arena where WSNs based on chemical and biochemical sensors are essential to bringing the IoT to the next level of impact. In contrast to physical transducers such as thermistors, photodetectors, and movement sensors, chemical and biochemical sensors rely on intimate contact with the sample (e.g., blood, sweat or tears in the case of WBSN or water or soil in the case of environment-based sensors). These classical chemical sensors and biosensors follow a generic measurement scheme, in which a prefunctionalized surface presents receptor sites that selectively bind a species of interest in a sample.

Since the early breakthroughs in the 1960 and 1970 s, which led to the development of a plethora of electrochemical and optochemical diagnostic devices, the vision of reliable and affordable sensors capable of functioning autonomously over extensive periods (years) to provide access to continuous streams of real-time data remains unrealized. This is despite significant investment in research and the many thousands of papers published in the literature. For example, it has been over 40 years since the concept of an artificial pancreas was proposed by combining the glucose electrode with an insulin pump [155]. Even now, there is no chemical sensor/biosensor that can function reliably inside the body for longer than a few days. The root problem remains the impact of biofouling and other processes that rapidly change the response characteristics of the sensor, leading to drift and sensitivity loss. Accordingly, in the past decade, scientists have begun to target more accessible media via less invasive means. This is in alignment with the exponential growth of the wearables market, which increasingly seeks to expand the current physical parameters to bring reliable chemical sensing to the wrists of over 3 billion wearers by 2025 [156].

Likewise, a number of low-cost devices to access molecular information via the analysis of sweat, saliva, interstitial, and ocular fluid have been proposed. At their core, these bodily fluids contain relatively high concentrations of electrolytes, such as sodium, potassium and ammonium salts, in addition to biologically relevant small molecules, such as glucose, lactate, and pyruvate. While the relative concentrations of these compounds in alternative bodily fluids deviate from those found in blood, they offer an accessible means to a wide range of clinically relevant data, which can be collated and analysed to offer wearer-specific models. Several groups have made significant progress towards the realisation of practical platforms for the quantification of electrolytes in sweat in recent years. Integrating ion-selective electrodes with a wearable system capable of harvesting and transporting sweat, watch-type devices have shown impressive ability to harvest sweat and track specific electrolytes in real time [157], as illustrated in Fig. 7. Similarly, a wearable electrochemical sensor array was developed by Javey et al. [158]. The resulting fully integrated system, capable of real-time detection of sodium, potassium, glucose, and lactate, is worn as a band on the forehead or arm, transmitting the data to a remote base station.

Contact lenses provide another means to access a wide range of molecular analytes in a noninvasive manner through information-rich ocular fluid. Pioneered by Badugu et al. [159] nearly 20 years ago, a smart contact lens could monitor ocular glucose through fluorescence changes. The initial design was further developed to encompass ions such as calcium, sodium, magnesium, and potassium [160]. While the capability of such a device is self-evident, such a restrictive sensing mode may ultimately hamper its application. It took several years for significant inroads into flexible electronics and wireless power transfer to enable a marriage of electrochemical sensing with a conformable contact lens. Demonstrated by Park et al. [161] for real-time quantification of glucose in ocular fluid, this platform indicates the potential of combining a reliable, accurate chemical sensing method with integrated power and electronics in a noninvasive approach to access important clinical data.

Although considerably more invasive than sweat or ocular fluid sensing, accurate determination of biomarkers in interstitial fluids has a proven track record. Indeed, the first FDA-approved means for noninvasive glucose monitoring, namely, Glucowatch [162], relied on interstitial fluid sampling extracted using reverse iontophoresis with subsequent electrochemical detection. This pioneering development from 2001 offered multiple measurements per hour and provided its wearer with an easy-to-use watch-like interface. Although ultimately hampered by skin irritation and calibration issues, it nonetheless signified a milestone in minimally invasive glucose measurements. Continuing with this approach, in 2017, the FDA approved device Abbott Freestyle Libre, which enables the wireless monitoring of blood sugar via analysis of the interstitial fluid. On application, the device punctures the skin and places a 0.5 cm fibre wick through the outer skin barrier so that interstitial fluid can be sampled and monitored for glucose in real time for up to two weeks, at which point it is replaced. Data are accessed using a wireless mobile phone-like base station.

In addition to delivering acceptable analytical performance in the relevant sample media, there are a number of challenges for on-body sensors related to size, rigidity, power, communication, data acquisition, processing, and security [163], which must be overcome before they can realize their full potential and play a pivotal role in applications of the IoT in healthcare services.

A similar scenario is faced in the environmental arena, in that (bio)chemical sensing is inherently more expensive and complex than monitoring physical parameters such as temperature, light, depth, or movement. This is strikingly illustrated in the Argo Project, which currently has ca. A total of 3,000–4,000 sensorized ‘floats’ are distributed globally in the oceans, all of which track location, depth, temperature, and salinity. These were originally devised to monitor several core parameters (temperature, pressure, and salinity) and share the data from this global sensor network via satellite communications links to provide an accurate in situ picture of the ocean status in real or near real time. Temperature and pressure data are accurate to ± 0.002 °C, and uncorrected salinity is accurate to ± 0.1 psu (can be improved by relatively complex and time-consuming postacquisition processing). Interestingly, an increasing number of floats now include ‘Biogeochemical’ sensors (308, ca. 10 %, April 2018) for nitrate (121), chlorophyll (186), oxygen (302) and pH (97); see http://www.argo.ucsd.edu and the maps in Fig. 8. Of these, nitrate is measured by direct UV absorbance, and chlorophyll is measured by absorbance/fluorescence at spectral regions characteristic of algal chlorophyll. These are optical measurements rather than conventional chemical sensor measurements. Moreover, it is likely that most, if not all, of the pH measurements are performed using optically responsive dyes rather than the well-known glass electrode. This strikingly demonstrates that chemical sensors are avoided when long-term, reliable and accurate measurements are required from remote locations and hostile environments. It is also striking that more complex chemical and biological measurements (i.e. ; that require analysers incorporating reagents, microfluidics, etc.) are not included in the Argo project. These autonomous analyzer platforms for tracking key parameters such as nutrients, concentration of dissolved oxygen (COD), pH, heavy metals, and organics typically cost €15 K or more per unit to buy, not including service and consumable charges. For example, the Seabird Electronics dissolved oxygen sensors used in the Argo project cost $60 K each [164], and Microlab autonomous environmental phosphate analysers cost ca. €20K per unit; i.e. they are far too expensive to use as basic building blocks of larger-scale deployments for IoT applications.

Progress towards realizing disruptive improvements is encouraged by competitions organised by environmental agencies such as the Alliance for Coastal Technologies (http://www.act-us.info/nutrients-challenge/index.php), who launched the ‘Global Nutrient Sensor Challenge’ in 2015. The purpose was to stimulate innovation in the sector, as participants had to deliver nutrient analysers capable of 3 months of independent in situ operation at a maximum unit cost of $5,000. The ACT estimated that the global market for these devices is currently ca. 30,000 units in the USA, and ca. 100,000 units globally (i.e. ca. $500 million per year). This is set to expand further as new applications related to nutrient recovery grow in importance (e.g., from biodigester units and wastewater treatment plants), driven by the need to meet regulatory targets and business opportunities linked to the rapidly increasing cost of nutrients.

4.3 Prospects for scalable applications of chemical sensors and biosensors

From the discussion so far, it is clear that there are substantial markets in personal health and environmental monitoring and other sectors for reliable chemical sensors and biosensors that are fit for purpose with an affordable use model. While progress has been painfully slow over the past 30–40 years, since the excitement of early promising breakthroughs [166], the beginnings of larger-scale use and a tentative move from single-use or centralised facilities towards real-time continuous measurements at point of need are now apparent. As this trend develops, the range and volume of data collected will rise exponentially, and new types of services will emerge, most likely borrowing ideas and models from existing applications, such as the myriad of products for personal exercise tracking, merged with new tools designed to deal with the more complex behaviour of molecular sensors.

In the health sector, this will support an expansion in the rollout of remote services due to the increasing availability of wearable/implantable diagnostic and autonomous drug delivery platforms that operate in a closed-loop control mode with real-time tracking of key biomarkers [167]. Of course, these services will be linked to an overarching personalized health informatics framework that enables healthcare professionals to monitor individual status remotely and triggers escalations in response if thresholds are breached or a future issue is predicted from data trends. Machine learning has already played a role in the diagnosis of SARS-CoV-2 responsible for the COVID-19 pandemic [168].

Likewise, in the environmental sector, unit costs for autonomous chemical analysers for water monitoring remain stubbornly high, constituting a significant barrier to scale up, particularly when coupled with a high cost of ownership due to frequent service intervals. This is a frustrating situation, as the tremendous benefits of long-term autonomous sensing of key status indicators in the health and water sectors are clear. For example, low-cost, reliable water quality analysers would revolutionise the way drinking, waste, and natural waters are monitored. Combining in situ, real-time water sensing with satellite/flyover remote sensing represents an immensely powerful development due to the tremendous enhancement of the integrated information content when the global scale of satellite sensing is coupled with the detailed molecular information generated by in situ deployed sensors and analyzers [169]. The scale of data generation from satellite remote sensing is already staggering, already reaching 21.1 PBs (petabytes) by 2015 and continuing to grow exponentially [170]. As Kathryn Sullivan, NOAA^{Footnote 1} Administrator and Under Secretary of Commerce for Oceans and Atmosphere, and former NASA astronaut commented recently, “NOAA observations alone provide some 20 terabytes every day—twice the data of the Library of Congress’ entire print collection”. She also commented on the importance of developing new tools and collaborations to realise the value of the encoded information: “Just 20 years ago, we were piecing together data points by hand. Five years ago, 90 % of today’s data had yet to be generated. Now we’re innovating in the cloud, experiencing Earth with a wider lens and in fresh new ways. NOAA (National Oceanic and Atmospheric Administration) is partnering with Amazon, Microsoft, IBM, Google, and the Open Commons Consortium to tap that potential.” The global coverage acquired with an increasingly fine spatial resolution, multiple spectral band sensors, and increasing numbers of satellites is driving this tremendous increase in the scale of data production. However, the benefits are already being realized, for example, through studies that couple highly localised data with large-scale coverage. For example, ecological models have been tested and used to visualize stratification dynamics in 2,368 lakes [171] and daily temperature profiles for almost 11,000 lakes [172]. In the latter case, the models have also been used to predict future water temperatures. In the near future, we are certain to see larger-scale in situ sensing networks, as new technologies emerge that combine lower unit cost with much longer service intervals. New data analysis and visualization tools will be required to enable these highly complementary information sources to be combined to maximize the effect.

Similarly, real-time tracking of disease markers could be coupled with smart drug delivery platforms. However, significant fundamental barriers still exist and must be surmounted if this revolution in sensing is to be realised. The most formidable is how to maintain and validate system performance during extended use (typically a minimum of 3 months for water monitoring and at least several years, preferably 10 years or more for implantables) [173]. In situ performance validation involves calibration, requiring fluidics and storage of standards/reagents. All components and solutions must function reliably for the service interval. Progress in water monitoring systems is easier as the interval is shorted and systems are more accessible, with a larger footprint. For implantables, however, the challenges are daunting, and in recent years, researchers have therefore focused on noninvasive/minimally invasive on-body use models as described above [158].

If devices capable of meeting the challenge of long-term reliable chemical/biological sensing could be realised, it will represent a keystone breakthrough upon which multiple applications with revolutionary impact on society could be built. It will require informatics systems and tools to process, filter, recognise patterns and events, and communicate with related data from personal, group and, ultimately, to societal levels, from a single location to global scale. Perhaps when this happens, we will witness at last the emergence of true internet-scale sensing and control via chemical sensors and biosensors, effectively creating a continuum between the molecular and digital worlds [174].

5 Concluding remarks: limitations and future prospects

Throughout this review, we adopted an optimistic tone regarding the recent advances and prospects for the use of big data and ML in materials science and related applications. We believe such optimism is justified by the proliferation of projects – in academia and industry–dedicated to developing artificial intelligence applications in several fields, including materials science. We take the view that, with massive investment and with so much at stake in the economy of corporations and countries, this ongoing AI-based revolution is unlikely to stop, with many positive prospects for the near future focused on materials science. Before discussing them, let us concentrate on the limitations and challenges to be faced by chemists and their collaborators in the short- and long-term futures of this revolution.

5.1 The state of the art

In describing the potential and effective use of big data and ML in Sect. 3, we did not conduct a critical analysis of the examples from the literature. Apart from a few exceptions, we restricted ourselves to highlighting the potential usefulness of these methodologies for several areas of materials science, with emphasis on materials discovery. We considered that identifying limitations (or even deficiencies) in the approaches adopted in specific cases has limited value in view of the underlying conceptual difficulties to be faced in paving the way for more expressive advances in materials science supported by big data and ML.

In particular, in the Introduction to this paper, when considering the two types of goals in applying ML, we emphasized that one of them is much harder to reach. According to Wallach [6], the two categories of goals are related to “prediction” and “explanation”. In the prediction category, observed data are employed to reason about unseen data or missing information. For the explanation category, the aim is to find plausible explanations for observed data, addressing “why” and “how” questions. As Wallach puts it, “models for prediction are often intended to replace human interpretation or reasoning, whereas models for explanation are intended to inform or guide human reasoning”. Regardless of all the recent results attained with ML, as we may recapitulate from Sect. 3, the examples found in chemistry are not necessarily useful to guide human reasoning. Despite the quality and usefulness of the results obtained in various studies conducted for “prediction” purposes, the application of ML to generalize and bring entirely new knowledge to materials science is still to be realized, as confirmed from a quick inspection of the examples discussed in this paper and in a recent review paper [175].

Furthermore, even in the state of the art of AI and ML, there are no clear hints as to whether and how uses will be possible within the “explanation” category, not only for materials science but also for other areas. These limitations have been discussed in a proposal to employ AI and uncertainty quantification to obtain correctable models [176] and in identifying domains where ML is applicable more efficiently [177]. One may speculate that the answer may result from the convergence of the two big movements mentioned in the Introduction – big data and natural language processing–but the specifics of the solutions are far from established. It is likely, for instance, that a theoretical framework is required to guide experimental design for data collection and to provide conceptual knowledge and effective predictive models [132]. On the other hand, current limitations mean that the full potential of ML in materials science has yet to be realized.

5.2 Pitfalls

Critical issues also stem from the combination of ML with big data. That is, to what extent having much data to learn from is an actual benefit? The overall principle common to big data and ML is that of abandoning physical-chemical simulations that might be infeasible due to the demands of a molecule-to-molecule relationship computation. The workaround is to use formulations and/or high-level material properties to feed ML techniques and teach them to identify patterns; in summary, a trained ML algorithm will work by mapping known physical properties into unknown physical properties. From this perspective, ML may be taken as a curve fitting technique capable of exploring search spaces way larger than those that could be handled by noncomputational approaches. This course of action has actual potential in predicting useful chemical compounds, but some ML techniques, notably DL methods, are not readily interpretable with respect to their mechanism of action. This fact raises two concerns. First, a chemist learns little or nothing about what caused a given set of outputs, which does not contribute to advancing the field as a principled science. Second, unlike computer vision or speech recognition problems, in which the outputs are directly verifiable, the outputs of ML in materials science may be biased by a limited training set or even totally distorted by an ill-defined model. Nevertheless, the chemist might take them for granted or, otherwise, take a long time before actually identifying the flaws. Again, such problems are not exclusive to the interplay between materials science and ML; in fact, they raise growing concerns and motivate ongoing investigations on how to enhance the interpretability of the representations created by these algorithms [178].

Other limitations stem from practical issues in connection with big data. Such methodologies do not apply to a variety of problems in materials science, where the need to collect and curate high-quality data and the computational infrastructure required may pose major challenges. Critical problems have been mentioned in the discussion of sensors and biosensors in Sect. 4. To work efficiently, most learning algorithms require a reasonably large set of known examples. While the cost of collecting, e.g., thousands of images and/and millions of comments in a social network, is nearly negligible, experiments in Materials Science might demand reagents, enzymes, compounds, combined with significant protocolled labour, and time to observe and annotate the results. As a possible consequence, computational techniques may lack the minimum amount of data necessary to produce trustful models. Rather than saving time and resources, the reverse may happen, leading to increased costs due to unwittingly applied erroneous procedures.

5.3 Challenges and prospects

In an additional challenge, many societal issues must be addressed to ensure the proper use of big data, including important ethical and privacy preservation questions. Take, for instance, the case of computer-assisted clinical diagnosis, which relies heavily on data from sensors and biosensors. As pointed out by Rodrigues-Jr et al. [50], an important obstacle to the thorough integration of databases is not related to the lack of technology but, rather, to commitments from individuals and institutions to work together in establishing acceptable procedures for data curation and privacy preservation and to avoid abusive or inadequate usage of medical data. The same holds in Materials Science.

In contrast to our approach regarding the uses of big data and ML in materials science, in reviewing the importance of sensors and biosensors to developing IoT and similar applications, we provided a critical analysis of the advances and limitations in the field. The prospects for the future depend on behavioral issues as well as scientific challenges. We hope that the acquired popularity of big data and ML may raise the awareness of researchers and developers on the importance of how they treat, preserve and analyze their data. We advocate that ML and other computational tools should already be in routine use not only by those working on sensors and biosensors but also in topics not traditionally considered sensing. The latter include various types of spectroscopy and imaging– which generate massive amounts of data. We believe there is a gap between the wide range of techniques available for data management and analysis and their actual use in the daily practice of many research and development facilities. This may be due to a combination of factors, such as a slow pace of dissemination of novel techniques, lack of a theoretical framework to guide the choice of techniques, and limited availability of accessible and usable implementations. Regardless of the reasons, increased awareness of this issue and more informed use of existing methodologies is an important step to reinforce progress.

As for the prospects for big data and ML for Materials Science in the next few years, it is self-evident that considerable advances can be attained by extending research efforts on predicting material properties, database-supported material design, identification of suitable compounds with genetic algorithms, synthesis prediction using DL, computer-aided drug design, and the determination of density-functional properties using alternative machine learning algorithms instead of calculations. One may, for instance, envisage searching for drugs and drug targets by harnessing the whole body of medical literature as a complement to applying those almost entirely chemistry-oriented approaches. An example of such a system is already being tested with a version of the Watson supercomputer [179], which deals with text in scientific papers and patents, in addition to considering pharmacological, chemical, and genomics data. The rationale behind the Watson approach is to establish connections in millions of text pages in a huge database of molecular properties. A similar approach was used to process toxicity data to predict the activity of chemical compounds [180]. Cases of enhanced prediction with clinical chemistry data were illustrated by Richardson et al. [9], who employ ML methods with large datasets for diseases such as hepatitis B and C. Nonetheless, the major goal of reaching truly intelligent systems to solve problems in chemistry beyond those of classification types will require the convergence of big data and ML within schemes that allow one to interpret and explain the results. This is a major challenge not only for materials science but also for any area of science and technology.

Notes

https://learn.arcgis.com/en/arcgis-imagery-book/chapter8/# ‘Learn-More’.
Available from the Environmental Systems Research Institute (ESRI).

References

von Lilienfeld OA, Burke K. Retrospective on a decade of machine learning for chemical discovery. Nat Commun. 2020;11(1):4895. https://doi.org/10.1038/s41467-020-18556-9.
Article CAS Google Scholar
Tibbetts KWM, Li R, Pelczer I, Rabitz H. Discovering predictive rules of chemistry from property landscapes. Chem Phys Lett. 2013;572:1–12. https://doi.org/10.1016/j.cplett.2013.03.040.
Article CAS Google Scholar
Chen CLP, Zhang C-Y. Data-Intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci (Ny). 2014;275:314–47. https://doi.org/10.1016/j.ins.2014.01.015.
Article Google Scholar
Lusher SJ, McGuire R, van Schaik RC, Nicholson CD, de Vlieg J. Data-driven medicinal chemistry in the Era of big data. Drug Discov Today. 2014;19(7):859–68. https://doi.org/10.1016/j.drudis.2013.12.004.
Article CAS Google Scholar
dos Santos LB, Júnior, EAC; Jr, Amancio ONO, Mansur DR, Aluísio LL, S. M. Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impairment from Speech Transcripts. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL; 2017; Vol. 1, pp 1284–1296. https://doi.org/10.18653/v1/P17-1118.
Wallach H. Computational social science ≠ computer science + social data. Commun ACM. 2018;61(3):42–4. https://doi.org/10.1145/3132698.
Article Google Scholar
Akimushkin C, Amancio DR, Oliveira ON. On the Role of words in the network structure of texts: application to authorship attribution. Phys A Stat Mech its Appl. 2018;495:49–58. https://doi.org/10.1016/j.physa.2017.12.054.
Article Google Scholar
Alpaydin E. Introduction to Machine Learning, 2nd ed.; The MIT Press, 2010.
Richardson A, Signor BM, Lidbury BA, Badrick T. Clinical chemistry in higher dimensions: machine-learning and enhanced prediction from routine clinical chemistry data. Clin Biochem. 2016;49(16–17):1213–20. https://doi.org/10.1016/j.clinbiochem.2016.07.013.
Article CAS Google Scholar
Gantz J, Reinsel D. Extracting value from chaos. IDC IView. 2011;1142:1–12.
Google Scholar
Alvarez-Moreno M, de Graaf C, Lopez N, Maseras F, Poblet JM, Bo C. Managing the Computational Chemistry Big Data Problem: The IoChem-BD Platform. J Chem Inf Model 2015, 55 (1), 95–103. https://doi.org/10.1021/ci500593j.
Xie Y-S, Kumar D, Bodduri VDV, Tarani PS, Zhao B-X, Miao J-Y, Jang K, Shin D-S. Microwave-assisted parallel synthesis of benzofuran-2-carboxamide derivatives bearing anti-Inflammatory, analgesic and antipyretic agents. Tetrahedron Lett. 2014;55(17):2796–800. https://doi.org/10.1016/j.tetlet.2014.02.116.
Article CAS Google Scholar
Gao H, Korn JM, Ferretti S, Monahan JE, Wang Y, Singh M, Zhang C, Schnell C, Yang G, Zhang Y, Balbin OA, Barbe S, Cai H, Casey F, Chatterjee S, Chiang DY, Chuai S, Cogan SM, Collins SD, Dammassa E, Ebel N, Embry M, Green J, Kauffmann A, Kowal C, Leary RJ, Lehar J, Liang Y, Loo A, Lorenzana E, Robert McDonald E, McLaughlin ME, Merkin J, Meyer R, Naylor TL, Patawaran M, Reddy A, Roelli C, Ruddy DA, Salangsang F, Santacroce F, Singh AP, Tang Y, Tinetto W, Tobler S, Velazquez R, Venkatesan K, Von Arx F, Wang HQ, Wang Z, Wiesmann M, Wyss D, Xu F, Bitter H, Atadja P, Lees E, Hofmann F, Li E, Keen N, Cozens R, Jensen MR, Pryer NK, Williams JA, Sellers WR. High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response. Nat Med. 2015;21(11):3. https://doi.org/10.1038/nm.3954.
Article CAS Google Scholar
Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA. Quantum chemistry structures and properties of 134 Kilo molecules. Sci Data. 2014;1(1):140022. https://doi.org/10.1038/sdata.2014.22.
Article CAS Google Scholar
Curtarolo S, Setyawan W, Wang S, Xue J, Yang K, Taylor RH, Nelson LJ, Hart GLW, Sanvito S, Buongiorno-Nardelli M, Mingo N, Levy O. AFLOWLIB.ORG: A distributed materials properties repository from high-throughput Ab initio calculations. Comput Mater Sci. 2012;58:227–35. https://doi.org/10.1016/j.commatsci.2012.02.002.
Article CAS Google Scholar
Smith JS, Isayev O, Roitberg AE. ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules. Sci Data. 2017;4(1):170193. https://doi.org/10.1038/sdata.2017.193.
Article CAS Google Scholar
Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016;44(D1):D1045-53. https://doi.org/10.1093/nar/gkv1072.
Article CAS Google Scholar
Schneider N, Lowe DM, Sayle RA, Tarselli MA, Landrum GA. Big data from pharmaceutical patents: a computational analysis of medicinal chemists’ bread and butter. J Med Chem. 2016;59(9):4385–402. https://doi.org/10.1021/acs.jmedchem.6b00153.
Article CAS Google Scholar
Tetko IV, Engkvist O, Koch U, Reymond J-L, Chen HBIGCHEM. Challenges and opportunities for big data analysis in chemistry. Mol Inform. 2016;35(11–12):615–21. https://doi.org/10.1002/minf.201600073.
Article CAS Google Scholar
Kelleher JD, Namee B, Mac; D’Arcy A. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies; The MIT Press, 2015.
LeCun Y, Bengio Y, Hinton G. Deep Learning. Nature. 2015;521(7553):436–44. https://doi.org/10.1038/nature14539.
Article CAS Google Scholar
Krizhevsky A, Sutskever I, Hinton GE ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems; NIPS’12; Curran Associates Inc.: USA, 2012; Vol. 1, pp 1097–1105.
Lecun Y, Bottou L, Bengio Y, Haffner P Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86 (11), 2278–2324. https://doi.org/10.1109/5.726791.
Bahrampour S, Ramakrishnan N, Schott L, Shah M. Comparative Study of Deep Learning Software Frameworks. 2016.
Goodfellow I, Bengio Y, Courville A Deep Learning (Adaptive Computation and Machine Learning Series); The MIT Press, 2016.
Hartnett M, Diamond D, Barker PG. Neural network based recognition of flow injection patterns. Analyst. 1993;118(4):347–54. https://doi.org/10.1039/AN9931800347.
Article CAS Google Scholar
Lindsay RK, Buchanan BG, Feigenbaum EA, Lederberg J. DENDRAL: A case study of the first expert system for scientific hypothesis formation. Artif Intell. 1993;61(2):209–61. https://doi.org/10.1016/0004-3702(93)90068-M.
Article Google Scholar
Szymkuc S, Gajewska EP, Klucznik T, Molga K, Dittwald P, Startek M, Bajczyk M, Grzybowski BA. Computer-assisted synthetic planning: the end of the beginning. Angew Chem Int Ed Engl. 2016;55(20):5904–37. https://doi.org/10.1002/anie.201506101.
Article CAS Google Scholar
Le TC, Winkler DA. Discovery and optimization of materials using evolutionary approaches. Chem Rev. 2016;116(10):6107–32. https://doi.org/10.1021/acs.chemrev.5b00691.
Article CAS Google Scholar
Pathak Y, Juneja KS, Varma G, Ehara M, Priyakumar UD. Deep learning enabled inorganic material generator. Phys Chem Chem Phys. 2020;22(46):26935–43. https://doi.org/10.1039/D0CP03508D.
Article CAS Google Scholar
Stocker S, Csányi G, Reuter K, Margraf JT. Machine learning in chemical reaction space. Nat Commun. 2020;11(1):5505. https://doi.org/10.1038/s41467-020-19267-x.
Article CAS Google Scholar
Frey NC, Horton MK, Munro JM, Griffin SM, Persson KA, Shenoy VB. High-throughput search for magnetic and topological order in transition metal oxides. Sci Adv. 2020;6(50):eabd1076. https://doi.org/10.1126/sciadv.abd1076.
Article CAS Google Scholar
Katritzky AR, Lobanov VS, Karelson MQSPR. The correlation and quantitative prediction of chemical and physical properties from structure. Chem Soc Rev. 1995;24(4):279–87. https://doi.org/10.1039/CS9952400279.
Article CAS Google Scholar
Le T, Epa VC, Burden FR, Winkler DA. Quantitative structure-property relationship modeling of diverse materials properties. Chem Rev. 2012;112(5):2889–919. https://doi.org/10.1021/cr200066h.
Article CAS Google Scholar
Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem. 2017;38(16):1291–307. https://doi.org/10.1002/jcc.24764.
Article CAS Google Scholar
Kalinin SV, Sumpter BG, Archibald RK. Big-deep-smart data in imaging for guiding materials design. Nat Mater. 2015;14(10):973–80. https://doi.org/10.1038/nmat4395.
Article CAS Google Scholar
Ward L, Wolverton C. Atomistic calculations and materials informatics: A review. Curr Opin Solid State Mater Sci. 2017;21(3):167–76. https://doi.org/10.1016/j.cossms.2016.07.002.
Article CAS Google Scholar
Breneman CM, Brinson LC, Schadler LS, Natarajan B, Krein M, Wu K, Morkowchuk L, Li Y, Deng H, Xu H. Stalking the mmaterials genome: a data-driven approach to the virtual design of nanostructured polymers. Adv Funct Mater. 2013;23(46):5746–52. https://doi.org/10.1002/adfm.201301744.
Article CAS Google Scholar
Jain A, Ong SP, Hautier G, Chen W, Richards WD, Dacek S, Cholia S, Gunter D, Skinner D, Ceder G, Persson KA. Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater. 2013;1(1):11002. https://doi.org/10.1063/1.4812323.
Article CAS Google Scholar
Liu Y, Niu C, Wang Z, Gan Y, Zhu Y, Sun S, Shen T. Machine learning in materials genome initiative: a Review. J Mater Sci Technol. 2020;57:113–22. https://doi.org/10.1016/j.jmst.2020.01.067.
Article Google Scholar
Subcommittee on the materials genome initiative. The first five years of the materials genome initiative: Accomplishments and Technical Highlights. National Science and Technology Council - Committee on Technology. Executive Office of the President of the United States. 2016, pp 1–9. http://mgi.gov/sites/default/files/documents/mgi-accomplishments-at-5-years-august-2016.pdf. Accessed Apr 2021.
Nakata M, Shimazaki T, PubChemQC Project. A large-scale first-principles electronic structure database for data-driven chemistry. J Chem Inf Model. 2017;57(6):1300–8. https://doi.org/10.1021/acs.jcim.7b00083.
Article CAS Google Scholar
Whitley D, Sutton AM. Genetic algorithms — a survey of models and methods. In: Handbook  Natural Comput. Berlin Heidelberg: Springer; 2012. p. 637–71. https://doi.org/10.1007/978-3-540-92910-9_21.
Chapter Google Scholar
Paszkowicz W. Genetic algorithms, a nature-inspired tool: survey of applications in materials science and related fields. Mater Manuf Process. 2009;24(2):174–97. https://doi.org/10.1080/10426910802612270.
Article CAS Google Scholar
Wolf D, Buyevskaya OV, Baerns M. An evolutionary approach in the combinatorial selection and optimization of catalytic materials. Appl Catal A Gen. 2000;200(1):63–77. https://doi.org/10.1016/S0926-860X(00)00643-8.
Article CAS Google Scholar
Bulut M, Gevers LEM, Paul JS, Vankelecom IFJ, Jacobs PA. Directed development of high-performance membranes via high-throughput and combinatorial strategies. J Comb Chem. 2006;8(2):168–73. https://doi.org/10.1021/cc050103j.
Article CAS Google Scholar
Corey EJ, Wipke WT. Computer-assisted design of complex organic syntheses. Science. 1969;166(3902):178–92. https://doi.org/10.1126/science.166.3902.178.
Article CAS Google Scholar
Jin W, Coley CW, Barzilay R, Jaakkola TS Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. In Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA; 2017; pp 2604–2613.
Liao SH. Expert system methodologies and applications—a decade review from 1995 to 2004. Expert Syst Appl. 2005;28(1):93–103. https://doi.org/10.1016/j.eswa.2004.08.003.
Article Google Scholar
Rodrigues-Jr JF, Paulovich FV, de Oliveira MC, de Oliveira ONJ. On the convergence of nanotechnology and big data analysis for computer-aided diagnosis. Nanomedicine (Lond). 2016;11(8):959–82. https://doi.org/10.2217/nnm.16.35.
Article CAS Google Scholar
Coley CW, Barzilay R, Jaakkola TS, Green WH, Jensen KF. Prediction of organic reaction outcomes using machine learning. ACS Cent Sci. 2017;3(5):434–43. https://doi.org/10.1021/acscentsci.7b00064.
Article CAS Google Scholar
Segler MHS, Waller MP. Modelling chemical reasoning to predict and invent reactions. Chem  A Eur J. 2017;23(25):6118–28. https://doi.org/10.1002/chem.201604556.
Article CAS Google Scholar
Qiao Z, Welborn M, Anandkumar A, Manby FR, Miller TF, OrbNet. Deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. J Chem Phys. 2020;153(12):124111. https://doi.org/10.1063/5.0021955.
Article CAS Google Scholar
Jha D, Choudhary K, Tavazza F, Liao W, Choudhary A, Campbell C, Agrawal A. Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning. Nat Commun. 2019;10(1):5316. https://doi.org/10.1038/s41467-019-13297-w.
Article CAS Google Scholar
Dimitrov T, Kreisbeck C, Becker JS, Aspuru-Guzik A, Saikin SK. Autonomous molecular design: then and now. ACS Appl Mater Interfaces. 2019;11(28):24825–36. https://doi.org/10.1021/acsami.9b01226.
Article CAS Google Scholar
Schwaller P, Gaudin T, Lanyi D, Bekas C, Laino T. “Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions Using Neural Sequence-to-Sequence Models. 2017.
Cadeddu A, Wylie EK, Jurczak J, Wampler-Doty M, Grzybowski BA. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew Chem Int Ed Engl. 2014;53(31):8108–12. https://doi.org/10.1002/anie.201403708.
Article CAS Google Scholar
Lowe DM Extraction of Chemical Structures and Reactions from the Literature, PhD Thesis at University of Cambridge, 2012. https://doi.org/10.17863/CAM.16293.
Weininger DSMILES. A chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31–6. https://doi.org/10.1021/ci00057a005.
Article CAS Google Scholar
Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci. 2018;4(2):268–76. https://doi.org/10.1021/acscentsci.7b00572.
Article CAS Google Scholar
Segler MHS, Waller MP. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry. 2017;23(25):5966–71. https://doi.org/10.1002/chem.201605499.
Article CAS Google Scholar
Ward L, Agrawal A, Choudhary A, Wolverton CA, General-Purpose. Machine learning framework for predicting properties of inorganic materials. npj Comput Mater. 2016;2(1):16028. https://doi.org/10.1038/npjcompumats.2016.28.
Article Google Scholar
Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA. Big data meets quantum chemistry approximations: the delta-machine learning approach. J Chem Theory Comput. 2015;11(5):2087–96. https://doi.org/10.1021/acs.jctc.5b00099.
Article CAS Google Scholar
Dral PO. Quantum chemistry in the age of machine learning. J Phys Chem Lett. 2020;11(6):2336–47. https://doi.org/10.1021/acs.jpclett.9b03664.
Article CAS Google Scholar
Dral PO, von Lilienfeld OA, Thiel W. Machine learning of parameters for accurate semiempirical quantum chemical calculations. J Chem Theory Comput. 2015;11(5):2120–5. https://doi.org/10.1021/acs.jctc.5b00141.
Article CAS Google Scholar
Lopez-Bezanilla A, von Lilienfeld OA. Modeling electronic quantum transport with machine learning. Phys Rev B. 2014;89(23):235411. https://doi.org/10.1103/PhysRevB.89.235411.
Article CAS Google Scholar
Schleder GR, Padilha ACM, Reily Rocha A, Dalpian GM, Fazzio A. Ab initio simulations and materials chemistry in the age of big data. J Chem Inf Model. 2020;60(2):452–9. https://doi.org/10.1021/acs.jcim.9b00781.
Article CAS Google Scholar
Kolb B, Lentz LC, Kolpak AM. Discovering charge density functionals and structure-property relationships with PROPhet: A general framework for coupling machine learning and first-principles methods. Sci Rep. 2017;7(1192):1–9. https://doi.org/10.1038/s41598-017-01251-z.
Article CAS Google Scholar
Finkelmann AR, Goller AH, Schneider G. Site of Metabolism Prediction Based on Ab Initio Derived Atom Representations. ChemMedChem. 2017;12(8):606–12. https://doi.org/10.1002/cmdc.201700097.
Article CAS Google Scholar
Pereira F, Xiao K, Latino DARS, Wu C, Zhang Q, Aires-de-Sousa J. Machine learning methods to predict density functional theory B3LYP energies of HOMO and LUMO orbitals. J Chem Inf Model. 2017;57(1):11–21. https://doi.org/10.1021/acs.jcim.6b00340.
Article CAS Google Scholar
Woodley SM, Day GM, Catlow R. Structure prediction of crystals, surfaces and nanoparticles. Philos Trans R Soc A Math Phys Eng Sci. 2020;378(2186):20190600. https://doi.org/10.1098/rsta.2019.0600.
Article CAS Google Scholar
Deringer VL, Csányi G, Proserpio DM. Extracting crystal chemistry from amorphous carbon structures. ChemPhysChem. 2017;18(8):873–7. https://doi.org/10.1002/cphc.201700151.
Article CAS Google Scholar
Ramakrishnan R, Hartmann M, Tapavicza E, von Lilienfeld OA. Electronic spectra from TDDFT and machine learning in chemical space. J Chem Phys. 2015;143(8):84111. https://doi.org/10.1063/1.4928757.
Article CAS Google Scholar
Dral PO, Owens A, Yurchenko SN, Thiel W. Structure-based sampling and self-correcting machine learning for accurate calculations of potential energy surfaces and vibrational levels. J Chem Phys. 2017;146(24):244108. https://doi.org/10.1063/1.4989536.
Article CAS Google Scholar
Janet JP, Kulik HJ. Predicting electronic structure properties of transition metal complexes with neural networks. Chem Sci. 2017;8(7):5137–52. https://doi.org/10.1039/C7SC01247K.
Article CAS Google Scholar
Schymanski EL, Ruttkies C, Krauss M, Brouard C, Kind T, Duhrkop K, Allen F, Vaniya A, Verdegem D, Bocker S, Rousu J, Shen H, Tsugawa H, Sajed T, Fiehn O, Ghesquiere B, Neumann S. Critical assessment of small molecule identification 2016: automated methods. J Cheminform. 2017;9(1):22. https://doi.org/10.1186/s13321-017-0207-1.
Article Google Scholar
Rupp M, Tkatchenko A, Muller K-R, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys Rev Lett. 2012;108(5):58301. https://doi.org/10.1103/PhysRevLett.108.058301.
Article CAS Google Scholar
Montavon G, Rupp M, Gobre V, Vazquez-Mayagoitia A, Hansen K, Tkatchenko A, Müller K-R, et al. Machine learning of molecular electronic properties in chemical compound space. New J Phys. 2013;15(9):95003. https://doi.org/10.1088/1367-2630/15/9/095003.
Article CAS Google Scholar
Pyzer-Knapp EO, Li K, Aspuru‐Guzik A. Learning from the Harvard clean energy project: the use of neural networks to accelerate materials discovery. Adv Funct Mater. 2015;25(41):6495–502. https://doi.org/10.1002/adfm.201501919.
Article CAS Google Scholar
Cui Y, Zhu P, Liao X, Chen Y. Recent advances of computational chemistry in organic solar cell research. J Mater Chem C. 2020;8(45):15920–39. https://doi.org/10.1039/D0TC03709E.
Article CAS Google Scholar
Antono E, Matsuzawa NN, Ling J, Saal JE, Arai H, Sasago M, Fujii E. Machine-learning guided quantum chemical and molecular dynamics calculations to design novel hole-conducting organic materials. J Phys Chem A. 2020;124(40):8330–40. https://doi.org/10.1021/acs.jpca.0c05769.
Article CAS Google Scholar
Wu Y, Guo J, Sun R, Min J. Machine learning for accelerating the discovery of high-performance donor/acceptor pairs in non-fullerene organic solar cells. npj Comput Mater. 2020;6(1):120. https://doi.org/10.1038/s41524-020-00388-2.
Article CAS Google Scholar
Mohimani H, Gurevich A, Mikheenko A, Garg N, Nothias L-F, Ninomiya A, Takada K, Dorrestein PC, Pevzner PA. Dereplication of peptidic natural products through database search of mass spectra. Nat Chem Biol. 2017;13(1):30–7. https://doi.org/10.1038/nchembio.2219.
Article CAS Google Scholar
Pires DEV, Blundell TL, Ascher DB, PkCSM. Predicting small-molecule pharmacokinetic and toxicity properties using graph-based signatures. J Med Chem. 2015;58(9):4066–72. https://doi.org/10.1021/acs.jmedchem.5b00104.
Article CAS Google Scholar
Cheng F, Li W, Zhou Y, Shen J, Wu Z, Liu G, Lee PW, Tang Y. AdmetSAR: a comprehensive source and free tool for assessment of chemical ADMET properties. J Chem Inf Model. 2012;52(11):3099–105. https://doi.org/10.1021/ci300367a.
Article CAS Google Scholar
Cheng F, Ikenaga Y, Zhou Y, Yu Y, Li W, Shen J, Du Z, Chen L, Xu C, Liu G, Lee PW, Tang Y. Silico assessment of chemical biodegradability. J Chem Inf Model. 2012;52(3):655–69. https://doi.org/10.1021/ci200622d.
Article CAS Google Scholar
Cheng F, Shen J, Yu Y, Li W, Liu G, Lee PW, Tang Y. Silico prediction of tetrahymena pyriformis toxicity for diverse industrial chemicals with substructure pattern recognition and machine learning methods. Chemosphere. 2011;82(11):1636–43. https://doi.org/10.1016/j.chemosphere.2010.11.043.
Article CAS Google Scholar
Cheng F, Yu Y, Shen J, Yang L, Li W, Liu G, Lee PW, Tang Y. Classification of cytochrome P450 inhibitors and noninhibitors using combined classifiers. J Chem Inf Model. 2011;51(5):996–1011. https://doi.org/10.1021/ci200028n.
Article CAS Google Scholar
Cheng F, Yu Y, Zhou Y, Shen Z, Xiao W, Liu G, Li W, Lee PW, Tang Y. Insights into molecular basis of cytochrome P450 inhibitory promiscuity of compounds. J Chem Inf Model. 2011;51(10):2482–95. https://doi.org/10.1021/ci200317s.
Article CAS Google Scholar
Shen J, Cheng F, Xu Y, Li W, Tang Y. Estimation of ADME properties with substructure pattern recognition. J Chem Inf Model. 2010;50(6):1034–41. https://doi.org/10.1021/ci100104j.
Article CAS Google Scholar
Broccatelli F, Carosati E, Neri A, Frosini M, Goracci L, Oprea TI, Cruciani G. A novel approach for predicting P-glycoprotein (ABCB1) inhibition using molecular interaction fields. J Med Chem. 2011;54(6):1740–51. https://doi.org/10.1021/jm101421d.
Article CAS Google Scholar
Ivanov J, Polshakov D, Kato-Weinstein J, Zhou Q, Li Y, Granet R, Garner L, Deng Y, Liu C, Albaiu D, Wilson J, Aultman C. Quantitative structure–activity relationship machine learning models and their applications for identifying viral 3CLpro- and RdRp-targeting compounds as potential therapeutics for COVID-19 and related viral infections. ACS Omega. 2020;5(42):27344–58. https://doi.org/10.1021/acsomega.0c03682.
Article CAS Google Scholar
Vo AH, Van Vleet TR, Gupta RR, Liguori MJ, Rao MS. An overview of machine learning and big data for drug toxicity evaluation. Chem Res Toxicol. 2020;33(1):20–37. https://doi.org/10.1021/acs.chemrestox.9b00227.
Article CAS Google Scholar
Karelson M, Lobanov VS, Katritzky AR. Quantum-chemical descriptors in QSAR/QSPR studies. Chem Rev. 1996;96(3):1027–44.
Article CAS Google Scholar
Gramatica P. Principles of QSAR models validation: internal and external. QSAR Comb Sci. 2007;26(5):694–701. https://doi.org/10.1002/qsar.200610151.
Article CAS Google Scholar
Tropsha A. Best practices for QSAR model development, validation, and exploitation. Mol Inform. 2010;29(6–7):476–88. https://doi.org/10.1002/minf.201000061.
Article CAS Google Scholar
Verma J, Khedkar VM, Coutinho EC. 3D-QSAR in drug design–a review. Curr Top Med Chem. 2010;10(1):95–115.
Article CAS Google Scholar
Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V, Oprea TI, Baskin II, Varnek A, Roitberg A, Isayev O, Curtalolo S, Fourches D, Cohen Y, Aspuru-Guzik A, Winkler DA, Agrafiotis D, Cherkasov A, Tropsha. A. QSAR without Borders. Chem Soc Rev. 2020;49(11):3525–64. https://doi.org/10.1039/D0CS00098A.
Article CAS Google Scholar
Kaggle Team. Deep learning how i did it: Merck 1st place interview. 2012, p 1. http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it-merck-1st-place-interview/. Accessed Apr 2021.
Mauri A, Consonni V, Pavan M, Todeschini R. Dragon software: an easy approach to molecular descriptor calculations. Match Commun Math Comput Chem. 2006;56(2):237–48.
CAS Google Scholar
Mayr A, Klambauer G, Unterthiner T, Hochreiter S. DeepTox: toxicity prediction using deep learning. Front Environ Sci. 2016;3:80. https://doi.org/10.3389/fenvs.2015.00080.
Article Google Scholar
Wallach I, Dzamba M, Heifets A, AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-Based Drug Discovery. CoRR 2015, 1510.02855.
Xu Y, Dai Z, Chen F, Gao S, Pei J, Lai L. Deep learning for drug-induced liver injury. J Chem Inf Model. 2015;55(10):2085–93. https://doi.org/10.1021/acs.jcim.5b00238.
Article CAS Google Scholar
Ekins S, Spektor AC, Clark AM, Dole K, Bunin BA. Collaborative drug discovery for more medicines for tuberculosis (MM4TB). Drug Discov Today. 2017;22(3):555–65. https://doi.org/10.1016/j.drudis.2016.10.009.
Article Google Scholar
Schneider G, Schneider P. Macromolecular target prediction by self-organizing feature maps. Expert Opin Drug Discov. 2017;12(3):271–7. https://doi.org/10.1080/17460441.2017.1274727.
Article CAS Google Scholar
Müller AT, Kaymaz AC, Gabernet G, Posselt G, Wessler S, Hiss JA, Schneider G. Sparse neural network models of antimicrobial peptide-activity relationships. Mol Inform. 2016;35(11–12):606–14. https://doi.org/10.1002/minf.201600029.
Article CAS Google Scholar
Porto WF, Pires AS, Franco OL. Antimicrobial activity predictors benchmarking analysis using shuffled and designed synthetic peptides. J Theor Biol. 2017;426:96–103. https://doi.org/10.1016/j.jtbi.2017.05.011.
Article CAS Google Scholar
Gawehn E, Hiss JA, Schneider G. Deep learning in drug discovery. Mol Inform. 2016;35(1):3–14. https://doi.org/10.1002/minf.201501008.
Article CAS Google Scholar
Mistry P, Neagu D, Trundle PR, Vessey JD. Using random forest and decision tree models for a new vehicle prediction approach in computational toxicology. Soft Comput. 2016;20(8):2967–79. https://doi.org/10.1007/s00500-015-1925-9.
Article Google Scholar
Feng R, Yu F, Xu J, Hu X. Knowledge gaps in immune response and immunotherapy involving nanomaterials: databases and artificial intelligence for material design. Biomaterials. 2021;266:120469. https://doi.org/10.1016/j.biomaterials.2020.120469.
Article CAS Google Scholar
Ekins S. The next era: deep learning in pharmaceutical research. Pharm Res. 2016;33(11):2594–603. https://doi.org/10.1007/s11095-016-2029-7.
Article CAS Google Scholar
Korotcov A, Tkachenko V, Russo DP, Ekins S. Comparison of Deep Learning with Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery Data Sets. Mol Pharm. 2017;14(12):4462–75. https://doi.org/10.1021/acs.molpharmaceut.7b00578.
Article CAS Google Scholar
Marks D, Hopf T, Sander C. Protein structure prediction from sequence variation. Nat Biotechnol. 2012;30:1072–80.
Article CAS Google Scholar
Di Lena P, Nagata K, Baldi P. Deep architectures for protein contact map prediction. Bioinformatics. 2012;28(19):2449–57. https://doi.org/10.1093/bioinformatics/bts475.
Article CAS Google Scholar
Fox NK, Brenner SE, Chandonia J-M, SCOPe. Structural classification of proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014;42(Database issue):D304-9. https://doi.org/10.1093/nar/gkt1240.
Article CAS Google Scholar
Paulovich FV, De Oliveira MCF, Oliveira ON. A future with ubiquitous sensing and intelligent systems. ACS Sensors. 2018;3(8):1433–8. https://doi.org/10.1021/acssensors.8b00276.
Article CAS Google Scholar
Wang F, Liu S, Shu L, Tao X-M. Low-dimensional carbon based sensors and sensing network for wearable health and environmental monitoring. Carbon N Y. 2017;121:353–67. https://doi.org/10.1016/j.carbon.2017.06.006.
Article CAS Google Scholar
Oliveira ON Jr, Iost RM, Siqueira JRJ, Crespilho FN, Caseli L. Nanomaterials for diagnosis: challenges and applications in smart devices based on molecular recognition. ACS Appl Mater Interfaces. 2014;6(17):14745–66. https://doi.org/10.1021/am5015056.
Article CAS Google Scholar
Rodriguez Gamboa JC, da Silva AJ, Araujo S, Albarracin IC, Duran EES. Validation of the rapid detection approach for enhancing the electronic nose systems performance, using different deep learning models and support vector machines. Sensors Actuators B Chem. 2021;327:128921. https://doi.org/10.1016/j.snb.2020.128921.
Article CAS Google Scholar
Okur S, Qin P, Chandresh A, Li C, Zhang Z, Lemmer U, Heinke L. An Enantioselective E-Nose: An Array of Nanoporous Homochiral MOF Films for Stereospecific Sensing of Chiral Odors. Angew Chemie Int Ed 2020, anie.202013227. https://doi.org/10.1002/anie.202013227.
Uçar A, Özalp R. Efficient android electronic nose design for recognition and perception of fruit odors Using kernel extreme learning machines. Chemom Intell Lab Syst. 2017;166:69–80. https://doi.org/10.1016/j.chemolab.2017.05.013.
Article CAS Google Scholar
Zhang L, Liu Y, Deng P. Odor recognition in multiple E-nose systems with cross-domain discriminative subspace learning. IEEE Trans Instrum Meas. 2017;66(7):1679–92. https://doi.org/10.1109/TIM.2017.2669818.
Article Google Scholar
Tomazzoli MM, Pai Neto RD, Moresco R, Westphal L, Zeggio ARS, Specht L, Costa C, Rocha M, Maraschin M. Discrimination of Brazilian propolis according to the seasoning using chemometrics and machine learning based on UV-Vis scanning data. J Integr Bioinform. 2015;12(4):279. https://doi.org/10.2390/biecoll-jib-2015-279.
Article Google Scholar
Nicoliche CYN, de Oliveira RAG, da Silva GS, Ferreira LF, Rodrigues IL, Faria RC, Fazzio A, Carrilho E, de Pontes LG, Schleder GR, Lima RS. Converging multidimensional sensor and machine learning toward high-throughput and biorecognition element-free multidetermination of extracellular vesicle biomarkers. ACS Sensors. 2020;5(7):1864–71. https://doi.org/10.1021/acssensors.0c00599.
Article CAS Google Scholar
Wetterich CB, de FelipeOliveira Neves R, Belasque J, Ehsani R, Marcassa LG. Detection of Huanglongbing in Florida using fluorescence imaging spectroscopy and machine-learning methods. Appl Opt. 2017;56(1):15–23. https://doi.org/10.1364/AO.56.000015.
Article Google Scholar
Mutlu AY, Kilic V, Ozdemir GK, Bayram A, Horzum N, Solmaz ME. Smartphone-based colorimetric detection via machine learning. Analyst. 2017;142(13):2434–41. https://doi.org/10.1039/C7AN00741H.
Article CAS Google Scholar
Paulovich FV, Moraes ML, Maki RM, Ferreira M, Oliveira ON Jr, de Oliveira MCF. Information visualization ttechniques for sensing and biosensing. Analyst. 2011;136(7):1344–50. https://doi.org/10.1039/C0AN00822B.
Article CAS Google Scholar
Oliveira ON, Pavinatto FJ, Constantino CJL, Paulovich FV, de Oliveira MCF. Information visualization to enhance sensitivity and selectivity in biosensing. Biointerphases. 2012;7(1–4):1–15. https://doi.org/10.1007/s13758-012-0053-7.
Article CAS Google Scholar
Inselberg A. The plane with parallel coordinates. Vis Comput. 1985;1(2):69–91. https://doi.org/10.1007/BF01898350.
Article Google Scholar
Thapa A, Soares AC, Soares JC, Awan IT, Volpati D, Melendez ME, Fregnani JHTG, Carvalho AL, Oliveira ONJ. Carbon nanotube matrix for highly sensitive biosensors to detect pancreatic cancer biomarker CA19-9. ACS Appl Mater Interfaces. 2017;9(31):25878–86. https://doi.org/10.1021/acsami.7b07384.
Article CAS Google Scholar
Moraes ML, Petri L, Oliveira V, Olivati CA, de Oliveira MCF, Paulovich FV, Oliveira ON, Ferreira M. Detection of glucose and triglycerides using information visualization methods to process impedance spectroscopy dData. Sensors Actuators B Chem. 2012;166–167:231–8. https://doi.org/10.1016/j.snb.2012.02.046.
Article CAS Google Scholar
Aileni RM. Healthcare Predictive Model Based on Big Data Fucion from Biomedical Sensors. In ELearning Vision 2020!; 2016; Vol. 1, pp 328–333. https://doi.org/10.12753/2066-026X-16-046.
McCarthy JF, Marx KA, Hoffman PE, Gee AG, O’Neil P, Ujwal ML, Hotchkiss J. Applications of machine learning and high-dimensional visualization in cancer detection, diagnosis, and management. Ann N Y Acad Sci. 2004;1020:239–62. https://doi.org/10.1196/annals.1310.020.
Article CAS Google Scholar
Legin A, Rudnitskaya A, Lvova L, Vlasov Y, Natale C, Di; D’Amico A. Evaluation of Italian wine by the electronic tongue: recognition, quantitative analysis and correlation with human sensory perception. Anal Chim Acta. 2003;484(1):33–44. https://doi.org/10.1016/S0003-2670(03)00301-5.
Article CAS Google Scholar
Daikuzono CM, Shimizu FM, Manzoli A, Riul A, Piazzetta MHO, Gobbi AL, Correa DS, Paulovich FV, Oliveira ON. Information visualization and feature selection methods applied to detect gliadin in gluten-containing foodstuff with a microfluidic electronic tongue. ACS Appl Mater Interfaces. 2017;9(23):19646–52. https://doi.org/10.1021/acsami.7b04252.
Article CAS Google Scholar
Ferreira EJ, Pereira RCT, Delbem ACB, Oliveira ON, Mattoso LH. C. Random subspace method for analysing coffee with electronic tongue. Electron Lett. 2007;43(21):1138–9. https://doi.org/10.1049/el:20071182.
Article Google Scholar
Byrne R, Diamond D. Chemo/bio-sensor networks. Nat Mater. 2006;5:421. https://doi.org/10.1038/nmat1661.
Article CAS Google Scholar
Tech Center IOT. TNLIST, Tsinghua. GreenOrbs. http://www.greenorbs.org/. Accessed Apr 2021.
Beckwith R, Teibel D, Bowen P Report from the Field: Results from an Agricultural Wireless Sensor Network. In 29th Annual IEEE International Conference on Local Computer Networks; 2004; pp 471–478. https://doi.org/10.1109/LCN.2004.105.
Burrell J, Brooke T, Beckwith R, Vineyard Computing. Sensor networks in agricultural production. IEEE Pervasive Comput. 2004;3(1):38–45. https://doi.org/10.1109/MPRV.2004.1269130.
Article Google Scholar
Morais R, Fernandes MA, Matos SG, Serôdio C, Ferreira PJSG, Reis MJCS. A zigBee multi-powered wireless acquisition device for remote sensing applications in precision viticulture. Comput Electron Agric. 2008;62(2):94–106. https://doi.org/10.1016/j.compag.2007.12.004.
Article Google Scholar
MicroStrain I. Shelburne vineyard relies on wireless sensors and the cloud to monitor its vines. 2012. http://www.microstrain.com/support/news/shelburne-vineyard-relies-wireless-sensors-and-cloud-monitor-its-vines. Accessed Apr 2021.
Werner-Allen G, Lorincz K, Ruiz M, Marcillo O, Johnson J, Lees J, Welsh M. Deploying a wireless sensor network on an active volcano. IEEE Internet Comput. 2006;10(2):18–25. https://doi.org/10.1109/MIC.2006.26.
Article Google Scholar
Park D-H, Park J-W. Wireless sensor network-based greenhouse environment monitoring and automatic control system for dew condensation prevention. Sensors (Basel). 2011;11(4):3640–51. https://doi.org/10.3390/s110403640.
Article Google Scholar
Mekki M, Abdallah O, Amin MBM, Eltayeb M, Abdalfatah T, Babiker A Greenhouse Monitoring and Control System Based on Wireless Sensor Network. In International Conference on Computing, Control, Networking, Electronics and Embedded Systems Engineering (ICCNEEE); 2015; pp 384–387. https://doi.org/10.1109/ICCNEEE.2015.7381396.
Cardell-Oliver R, Kranz M, Smettem K, Mayer KA. Reactive soil moisture sensor network: design and field evaluation. Int J Distrib Sens Networks. 2005;1(2):149–62. https://doi.org/10.1080/15501320590966422.
Article Google Scholar
Diamond D, Lau KT, Brady S, Cleary J. Integration of analytical measurements and wireless communications-current issues and future strategies. Talanta. 2008;75(3):606–12. https://doi.org/10.1016/j.talanta.2007.11.022.
Article CAS Google Scholar
Larios DF, Barbancho J, Sevillano JL, Rodriguez G, Molina FJ, Gasull VG, Mora-Merchan JM, Leon C. Five years of designing wireless sensor networks in the Donana Biological Reserve (Spain): an applications approach. Sensors (Basel). 2013;13(9):12044–69. https://doi.org/10.3390/s130912044.
Article Google Scholar
Martonosi M. Embedded Systems in the Wild: ZebraNet Software, Hardware, and Experiences D. SIGPLAN Not. 2006, 41 (7), 1. https://doi.org/10.1145/1159974.1134651.
Xu G, Shen W, Wang X. Applications of wireless sensor networks in marine environment monitoring: A survey. Sensors (Basel). 2014;14(9):16932–54. https://doi.org/10.3390/s140916932.
Article Google Scholar
Johnson KS, Needoba JA, Riser SC, Showers WJ. Chemical sensor networks for the aquatic environment. Chem Rev. 2007;107(2):623–40. https://doi.org/10.1021/cr050354e.
Article CAS Google Scholar
Abbasi Aqeel-Ur-Rehman, Islam AZ, Shaikh N. A review of wireless sensors and networks’ applications in agriculture. Comput Stand Interfaces. 2014;36(2):263–70. https://doi.org/10.1016/j.csi.2011.03.004.
Article Google Scholar
Hollander R. Two-thirds of the world’s population are now connected by mobile devices. Business Insider UK. Sep. 2017. http://uk.businessinsider.com/world-population-mobile-devices-2017-9. Accessed Apr 2021.
Jakicic JM, Davis KK, Rogers RJ, King WC, Marcus MD, Helsel D, Rickman AD, Wahed AS, Belle SH. Effect of wearable technology combined with a lifestyle intervention on long-term weight loss: The IDEA Randomized Clinical Trial. JAMA. 2016;316(11):1161–71. https://doi.org/10.1001/jama.2016.12858.
Article Google Scholar
Albisser AM, Leibel BS, Ewart TG, Davidovac Z, Botz CK, Zingg W, Schipper H, Gander R. Clinical control of diabetes by the artificial pancreas. Diabetes. 1974;23(5):397–404.
Article CAS Google Scholar
Hayward J, Pugh D, Chansin G Wearable Sensors 2018–2028: Technologies, Markets & Players. IDTechEx. 2017, pp 1–292. http://www.idtechex.com/research/reports/wearable-sensors-2018-2028-technologies-markets-and-players-000555.asp. Accessed Apr 2021.
Glennon T, O’Quigley C, McCaul M, Matzeu G, Beirne S, Wallace GG, Stroiescu F, O’Mahoney N, White P, Diamond D. “SWEATCH”: a wearable platform for harvesting and analysing sweat sodium content. Electroanalysis. 2016;28(6):1283–9. https://doi.org/10.1002/elan.201600106.
Article CAS Google Scholar
Gao W, Emaminejad S, Nyein HYY, Challa S, Chen K, Peck A, Fahad HM, Ota H, Shiraki H, Kiriya D, Lien D-H, Brooks GA, Davis RW, Javey A. Fully integrated wearable sensor arrays for multiplexed in situ perspiration analysis. Nature. 2016;529:509–14.
Article CAS Google Scholar
Badugu R, Lakowicz JR, Geddes CD. Noninvasive continuous monitoring of physiological glucose using a monosaccharide-sensing contact lens. Anal Chem. 2004;76(3):610–8. https://doi.org/10.1021/ac0303721.
Article CAS Google Scholar
Badugu R, Jeng BH, Reece EA, Lakowicz JR. Contact lens to measure individual ion concentrations in tears and applications to dry eye disease. Anal Biochem. 2018;542:84–94. https://doi.org/10.1016/j.ab.2017.11.014.
Article CAS Google Scholar
Park J, Kim J, Kim S-Y, Cheong WH, Jang J, Park Y-G, Na K, Kim Y-T, Heo JH, Lee CY, Lee JH, Bien F, Park J-U. Soft, Smart Contact Lenses with Integrations of Wireless Circuits, Glucose Sensors, and Displays. Sci Adv 2018, 4 (1). https://doi.org/10.1126/sciadv.aap9841.
Tierney MJ, Tamada JA, Potts RO, Jovanovic L, Garg S. Clinical evaluation of the glucoWatch biographer: A continual, non-invasive glucose monitor for patients with diabetes. Biosens Bioelectron. 2001;16(9–12):621–9.
Article CAS Google Scholar
Bandodkar AJ, Jeerapan I, Wang J. Wearable chemical sensors: present challenges and future prospects. ACS Sensors. 2016;1(5):464–82. https://doi.org/10.1021/acssensors.6b00250.
Article CAS Google Scholar
Argo. Part of the integrated global observation strategy. http://www.argo.ucsd.edu/. Accessed Apr 2021.
Argo. Argo float data and metadata from global data assembly centre (Argo GDAC). 2000. https://doi.org/10.17882/42182. Accessed Apri 2021.
H Garrett DeYoung. Biosensors - the mating of biology and electronics. High Technol. 1983;11:41–9.
Google Scholar
Patra D, Sengupta S, Duan W, Zhang H, Pavlick R, Sen A, Intelligent. Self-powered, drug delivery systems. Nanoscale. 2013;5(4):1273–83. https://doi.org/10.1039/C2NR32600K.
Article CAS Google Scholar
Mattioli IA, Hassan A, Oliveira ON, Crespilho FN. On the challenges for the diagnosis of SARS-CoV-2 based on a review of current methodologies. ACS Sensors. 2020;5(12):3655–77. https://doi.org/10.1021/acssensors.0c01382.
Article CAS Google Scholar
McCaul M, Barland J, Cleary J, Cahalane C, McCarthy T, Diamond D. Combining remote temperature sensing with in-situ sensing to track marine/freshwater mixing dynamics. Sensors. 2016;16(9):1402–18. https://doi.org/10.3390/s16091402.
Article CAS Google Scholar
Fan J, Yan J, Ma Y, Wang L. Big data integration in remote sensing across a distributed metadata-based spatial infrastructure. Remote Sens. 2017;10(2):7. https://doi.org/10.3390/rs10010007.
Article Google Scholar
Read JS, Winslow LA, Hansen GJA, Van Den Hoek J, Hanson PC, Bruce LC, Markfort CD. Simulating 2368 temperate lakes reveals weak coherence in stratification phenology. Ecol Modell. 2014;291:142–50. https://doi.org/10.1016/j.ecolmodel.2014.07.029.
Article Google Scholar
Winslow LA, Hansen GJA, Read JS, Notaro M. Large-scale modeled contemporary and future water temperature estimates for 10774 Midwestern U.S. Lakes. Sci Data. 2017;4:170053. https://doi.org/10.1038/sdata.2017.53.
Article Google Scholar
Coleman S, Florea L, Diamond D. Chemical Sensing with Autonomous Devices in Remote Locations - Why Is It so Difficult and How Do We Deliver Revolutionary Improvements in Performance. Irish Chem. News 2016, No. 1, February, 13–23.
Diamond D. Internet-scale, sensing. Anal Chem. 2004;76(15):278. https://doi.org/10.1021/ac041598m.
Article Google Scholar
Batra R, Song L, Ramprasad R. Emerging materials intelligence ecosystems propelled by machine learning. Nat Rev Mater. 2020. https://doi.org/10.1038/s41578-020-00255-y.
Article Google Scholar
Feng J, Lansford JL, Katsoulakis MA, Vlachos DG. Explainable and trustworthy artificial intelligence for correctable modeling in chemical sciences. Sci Adv. 2020;6(42):eabc3204. https://doi.org/10.1126/sciadv.abc3204.
Article Google Scholar
Sutton C, Boley M, Ghiringhelli LM, Rupp M, Vreeken J, Scheffler M. Identifying domains of applicability of machine learning models for materials science. Nat Commun. 2020;11(1):4428. https://doi.org/10.1038/s41467-020-17112-9.
Article CAS Google Scholar
Yosinski J, Clune J, Nguyen AM, Fuchs TJ, Lipson H Understanding Neural Networks Through Deep Visualization. CoRR 2015, abs/1506.0.
Chen Y, Argentinis JDE, Weber GIBM, Watson. How cognitive computing can be applied to big data challenges in life sciences research. Clin Ther. 2016;38(4):688–701. https://doi.org/10.1016/j.clinthera.2015.12.001.
Article Google Scholar
Bouhedjar K, Boukelia A, Khorief Nacereddine A, Boucheham A, Belaidi A, Djerourou AA. Natural language processing approach based on embedding deep learning from heterogeneous compounds for quantitative structure–activity relationship modeling. Chem Biol Drug Des. 2020;96(3):961–72. https://doi.org/10.1111/cbdd.13742.
Article CAS Google Scholar

Download references

Acknowledgements

This work received support from Brazilian agencies FAPESP (2016/17078-0, 2018/22214-6, 2013-14262-7, 2018/17620-5), CNPq (305580/2017-5, 406550/2018-2) and CAPES (Finance Code 001). DD acknowleges support from the Science Foundation Ireland from the Insight Centre for Data Analytics, grant SFI/12/RC/2289_P2. LF acknowledges support from the European Research Council (ERC) Starting Grant (project number 802929-ChemLife) and SFI under Grant number 12/RC/2278_P2. The authors are also grateful to Dr. Don Pierson from the Evolutionary Biology Centre, Uppsala, Sweden, for helpful discussions on the scale of satellite remotely sensed data.

Author information

Authors and Affiliations

Institute of Mathematical Sciences and Computing, University of São Paulo (USP), São Carlos, SP, Brazil
Jose F. Rodrigues Jr & Maria C. F. de Oliveira
SFI Research Centre for Advanced Materials and BioEngineering Research Trinity College Dublin, The University of Dublin, Dublin, Ireland
Larisa Florea
Insight Centre for Data Analytics, National Centre for Sensor Research, Dublin City University, Dublin 9, Dublin, Ireland
Dermot Diamond
São Carlos Institute of Physics, University of São Paulo (USP), São Carlos, SP, Brazil
Osvaldo N. Oliveira Jr

Authors

Jose F. Rodrigues Jr
View author publications
You can also search for this author in PubMed Google Scholar
Larisa Florea
View author publications
You can also search for this author in PubMed Google Scholar
Maria C. F. de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Dermot Diamond
View author publications
You can also search for this author in PubMed Google Scholar
Osvaldo N. Oliveira Jr
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JFRJ did most of the survey on big data and machine learning, and wrote part of the manuscript LF did most of the survey on sensor networks, and wrote part of the manuscript MCFO wrote part of the manuscript and revised the computational aspects in the text DD supervised the survey on sensors and applications, and revised the whole manuscript ONOJ produced the first outline of the survey, and revised the whole manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Osvaldo N. Oliveira Jr.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rodrigues, J.F., Florea, L., de Oliveira, M.C.F. et al. Big data and machine learning for materials science. Discov Mater 1, 12 (2021). https://doi.org/10.1007/s43939-021-00012-0

Download citation

Received: 09 February 2021
Accepted: 01 April 2021
Published: 19 April 2021
DOI: https://doi.org/10.1007/s43939-021-00012-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Big data and machine learning for materials science

Abstract

Similar content being viewed by others

Phase-field method of materials microstructures and properties

Methods of determining the degree of crystallinity of polymers with X-ray diffraction: a review

Structural characterization of polycrystalline thin films by X-ray diffraction techniques

1 Introduction

2 New trends in big data and machine learning relevant to materials sciences

2.1 Big data

2.2 Machine learning (ML)

2.3 An overview of deep learning

2.4 The flow path towards data‐based scientific discovery in materials science

3 Materials discovery

3.1 Large databases and initiatives

3.2 Identification of compounds with genetic algorithms

3.3 Synthesis prediction using ML

3.4 Quantum chemistry

3.5 Computer‐aided drug design

4 Sensor‐based data production for computational intelligence

4.1 ML in sensor applications

4.2 Providing data for big data and ML applications with chem/biosensor networks

4.3 Prospects for scalable applications of chemical sensors and biosensors

5 Concluding remarks: limitations and future prospects

5.1 The state of the art

5.2 Pitfalls

5.3 Challenges and prospects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation