Introduction

Material science has developed rapidly in the twenty-first century, both theoretically and experimentally, such as the development of gas conversion catalytic materials, the discovery of energy harvesting and storage materials, the design of information functional materials, etc1,2,3. As an interdisciplinary subject of material science and computer science, computational material science is increasingly powerful due to the significant improvement of computing devices, and has become a bridge between theoretical prediction and experimental research3,4,5. Computational material science not only frees theoretical work from the bondage of analytical derivation, but also carries on the fundamental reform to the experimental research methods, which is more conducive to researchers to reveal and confirm objective laws from experimental phenomena. Currently, the modern material-simulation toolkits (e.g., Vienna Ab Initio Simulation Package (VASP)6, Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)7, Quantum Espresso8, crystal structure analysis by particle swarm optimization (CALYPSO)9,10, nonadiabatic molecular dynamics (Hefei-NAMD)11, and defect and dopant ab-initio simulation package (DASP)12, and user-friendly VASPKIT13) have brought computational material science to the masses in form of useful practical tools, enabling experimentalists with little or no theoretical training to perform first-principles calculations (e.g., density functional theory (DFT) calculations14,15). Consequently, high-throughput calculation (HTC) becomes a routine approach, and accelerates the development of databases with materials (organic and inorganic crystals, single molecules, and metal alloys) and properties (band gaps, formation energies, ionic conductivities, and elastical modulus16,17,18). The Materials Genome Initiative (MGI) proposed in 2011 pushed computational material science into high gear19,20, and many material databases and platforms sprung out, such as the Materials Project (MP)16, the Open Quantum Materials database (OQMD)21, the Novel Materials Discovery (NOMAD)22, and various proprietary databases from the literatures. Afterwards, six application-focused areas were identified as important (health and consumer materials, information technology materials, etc.), and a new route was planned for the development of new materials. Further investment in MGI principles can generate extraordinary advances that can spark revolutionary new technologies and provide important opportunities for the next generation of advanced materials with transformative impact23.

The establishment and sharing of these databases offer an opportunity for the emergence of the “fourth paradigm of science” and the “fourth industrial revolution”, i.e. the “data-driven material discovery”24, the critical idea of which is the combination of big data, aritifical intelligence (AI), and material science25,26,27,28. The number of AI applications in material science is growing at an alarming rate, with notable success in many systems, such as batteries29,30,31, solar cells32,33,34, ecomaterials35,36. Just like the implementation of quantum mechanical (QM) computing softwares, it is necessary to develop infrastructures that combine material science and AI in order to enable both AI researchers and material experts to design materials using AI methods (machine learning (ML), etc.). Several pioneering efforts have been launched in recent years to achieve this goal37,38. Ward et al. developed a material data mining toolkit (Matminer), which offered one-stop access to multiple data sets and provided feature descriptors of components and structures for property prediction. This toolkit has become an important foundation for the joint use of AI and material data39. However, Matminer does not contain AI routines itself, but instead processes data format in order to make various downstream AI libraries available for material science applications. The subsequent Automatminer pipeline can performs many AI steps (feature engineering, model selection, hyperparameter tuning, etc.), allowing the combined application of AutoML and Matminer to implement end-to-end material modeling pipelines40. Also, the Materials Simulation Toolkit for Machine Learning (MAST-ML) was proposed to broaden and accelerate the use of ML in material science, which lowered the barrier to entry for supervised learning (SL) modeling41. NOMAD AI toolkit, a web-browser-based platform for performing AI analysis of materials-science data was presented by Sbailò et al., which will bring the concept of reproducibility in material science to the next level42. Whereas, using Matminer/AutoMatminer requires the basic of programming, such as Python, which is unfriendly to material designers with little programming experience, and other ML modeling (transfer learning (TL), unsupervised learning (UL), etc) need to be integrated to MAST-ML. The list of more toolkits was provided by Morgan and Jacobs43. Generally speaking, existing material informatic tools can still be improved further. It is necessary to establish a material informatic platform that supports all commonly used AI algorithms, requires no or minimal programming skill, and contains material databases. In addition, the lack of data of material properties, as well as inaccurate material descriptors, have become challenges for materials modeling.

Here, we developed an AI platform, AlphaMat, that supports the whole life cycle of material modeling with over 90 functions (data collection → data preprocessing → feature engineering → model establishment → parameter optimization → model evaluation → result analysis). AlphaMat has a higher applicability in material modeling, benefiting from component and structural descriptors. AlphaMat is the first material informatic platform that possesses SL, TL, and UL simultaneously, and can tackle the tasks of material modeling without the limitation of data scale. In addition, AlphaMat has an interactive interface, runs locally, requires no programming experience. As typical cases, we collected 12 material property databases from experiments and HTC calculations, including formation energy, metal/semiconductor, phonon property, dielectric constant, ionic conductivity, thermal conductivity, optical property, magnetism, ferroelectric property, band gap, bulk modulus, and adsorption energy (covering a total of 19,488 materials). And then AI models were established, which can be used to enhance photoelectric conversion efficiency, improve conductivity of metallic electrode materials, promote cycle performance of batteries, discover new solid-state electrolyte, inhibit the shuttle effect of Li-S batteries, develop high thermal conductivity materials, solve the heat dissipation of electronic devices, etc. Compared with the time cost of experiments or calculations used to construct the database, AlphaMat saved significant time cost and hardware cost in material discovery. The practical application in energy science demonstrates AlphaMat’s ability to discover and design materials that it successfully identify 491 potential photovoltaic materials, 78 metallic electrode materails, 9 solid-state electrolytes, 58 thermal-conductivity materials, and 39 cathodes of Li-S batteries. By AlphaMat, users can directly search the database according to various needs; AI models can also be easily built on any data scale to discover and design materials. Following the principles of interaction, scalability, efficiency and intelligence, AlphaMat, together with many other toolkits built by the larger material community are expected to promote and accelerate the development of material science, computer science, physical and chemical science.

Results

Overview and architecture

Considering the current challenges and requirements of material modeling, AlphaMat was developed with nine core elements (Fig. 1): (1) Proprietary databases. AlphaMat aims to build database of material properties from experiments, calculations, literatures, and open databases (e.g., databases of formation energy or band gap). (2) Data processing and analysis. The establishment of material data requires the unification of data format, the conversion of file format and the statistics of material properties. (3) Material descriptor design. AlphaMat can calculate suitable digitization vectors or matrices to represent materials, including component and structural descriptors. (4) Quantitative structure-property relationships (QSPR). Establishment of material-property QSPR through AI models is the most important goal and pursuit of AlphaMat. (5) New materials. Based on the well-trained QSPR, new materials with suitable properties can be explored and identified. (6) Novel properties of materials that have not been reported/studied before. (7) Physical interpretability to uncover the feature importance from AI models for the material design, which is the challenge and pursuit of material informatics. (8) End-to-end targeted design, which is closely related to physical interpretability and establish a pattern of input-to-output automation that facilitates practical applications. (9) Advanced applications. The ultimate goal of AlphaMat, is to promote the progress of various material systems (e.g., superconducting materials, battery materials, piezoelectric materials) by discovering high-performance materials for applications.

Fig. 1: Overview of AlphaMat.
figure 1

Focusing on nine elements of material informatics (proprietary databases, data processing and analysis, material descriptor design, QSPR, new materials, novel properties, physical interpretability, end-to-end targeted design, and advanced applications), AlphaMat aims to accelerate the development of materials, shorten the development cycle of materials, and reduce the cost of experiments and traditional computations.

The organization of AlphaMat abided by the data roadmap in the research field of material informatics, from data collection, data preprocessing, ML, and application, as shown in Fig. 2. More modeling process can be found in Supplementary Note 1. In AlphaMat (v0.0.7), over 90 functions have been designed, and some useful tools were used (e.g., Matminer39, Python Materials Genomics (Pymatgen)44, Scikit-Learn45, extreme gradient boosting decision tree (XGBoost)46, and Mendeleev47). Researchers can use AlphaMat to complete the entire process of AI and material modeling. The introduction of material descriptors, AI models, and analysis tools are provided in Supplementary Note 24.

Fig. 2: Architecture of AlphaMat.
figure 2

a Input options, used to read and convert various material structure files. b Feature engineering, where component and structural descriptors are provided for materials representation. c Data-processing, which can preprocess the obtained features. d Machine learning, which covers almost all current AI modeling requirements. e Materials tool, which integrates a variety of convenient material data processing scripts to improve efficiency. f Materials database, also one of the main tasks of the software. It builds proprietary databases based on different material properties. g Output/result, which can be further analyzed with other various visualization tools (visual module is under development). h Application, the research areas/material systems to which the whole software can be applied.

Modeling cases

AlphaMat provides a complete process of the data collection → data preprocessing → feature engineering → model establishment → parameter optimization → model evaluation → result analysis. Therefore, AlphaMat will play a great role in calculating material descriptors, establishing QSPR, and material screening and mining.

Here, as case studies, we used AlphaMat to predict 12 typical material properties (containing eight regression tasks and four classification tasks, see Table 1) with 19,488 data points totally, and highlighted the advantages of AlphaMat in these works. The twelve material properties are formation energy (Ef), band gap (Eg), the maximum frequency of an acoustic mode at Γ (breaking of the ASR, BASR), dielectric constant (εpoly), bulk modulus (K), ion migration activation energy (Ea), thermal conductivity (κ), second harmonic generation (SHG) responses, metals/semiconductors, ferroelectric/non-ferroelectric materials (Ferro/Non-ferro), strong/weak adsorption energy (ΔE), and ferromagnetic/antiferromagnetic materials (FM/AFM). It is worth noting that we chose the component descriptor of element property (a 120-element vector) as the material descriptor, which was defined by Meredig et al. and integrated in AlphaMat (instruction of 805)48. XGBoost model was applied for model training, which was widely used in material science46,49,50. The descriptions of XGBoost are shown in 12104 (for classification tasks) and 12204 (for regression tasks) (see http://www.aimslab.cn). The data was split into training set (80%) and testing set (20%). As shown in Table 1 and Supplementary Figs. 112, the Pearson correlation coefficients (PC) of eight regression models are from 0.675 to 0.933, with an average of 0.843, and the precision of four classification models are from 0.82 to 0.93, with an average of 0.868. More details and applications are provided in Supplementary Note 517. These typical case studies demonstrate the strong modeling ability of AlphaMat in material property predictions and material discovery.

Table 1 Twelve case studies.

The 19,488 data points currently used for modeling are just the tip of the iceberg in the vast material space, and there are hundreds of millions of materials and properties to be explored. For example, in the MP database16, the Ef and K of 144,595 materials data can be modeled by AlphaMat, saving significant computational cycles. In addition, the Eg prediction model established by AlphaMat can predict the Eg of ~68,000 materials (with unknown Eg in the MP database16) at the experimental level (each takes 64 h51), which will save a lot of experimental cycle. It can be seen that, based on the existing experimental/computational data, modeling based on AlphaMat can greatly shorten the experimental/computational cycle for new material discovery. We can foresee that AlphaMat will be an important part in existing material informatics software, accelerating the deployment of material engineering.

Practical applications in high-performance materials

Twelve case studies demonstrate AlphaMat’s capabilities in material modeling. Here, with different material property modeling, we presented several practical examples about electrode materials, photoelectric materials, solid-state electrolyte materials, and thermal-conductivity materials, etc.

Practical applications based on E g

Eg is a key characteristic of electronic materials. For example, in perovskite solar cells, the hole transport layer (HTL) and the electron transport layer (ETL) should have appropriate Eg (0.9–1.6 eV) to ensure the efficient transmission of holes and electrons and the implement of optimal optical conversion efficiency52,53. Electrode materials generally have high electronic conductivity, i.e., Eg = 054,55, while solid-state electrolytes require extremely low electronic conductivity, i.e., Eg > 3.5 eV51,56,57. Thus, accurately determining Eg is the key to select functional materials and accelerate their development.

MP database contains 144,595 data entries16, among which the studies of mono-element compounds are quite mature, while the laboratory synthesis of multi-element compounds is challenging. Therefore, binary (BC), ternary (TC), quaternary (QC), and pentabasic (PC) compounds were selected from MP to establish the initial data set. In addition, thermal stability is the most basic property of materials, so we excluded materials with convex hull energy (Ehull) greater than 0, leaving 32,858 materials in the end (5039 BC, 19,257 TC, 7287 QC, 1275 PC). Among the 12 case studies in Table 1, C1 can distinguish metals (Eg = 0) and semiconductors (Eg > 0), R2 can predict the Eg for semiconductors. By using element property as the material descriptor (805 in AlphaMat), we made use of well-trained C1and R2 models for searching new materials.

As shown in Fig. 3a, using t-distributed stochastic neighbor embedding (t-SNE) method, the sites are colored with their compound types, and compounds with different number of element types can be distinguished, as the sites of PC, QC, TC, and BC are stacked on top of each other. In MP database, the Eg values of 32,858 compounds were calculated based on semi-empirical or low-precision functional, which deviate greatly from the experimental values (the deviation is 1.0–2.0 eV generally) and are difficult to be directly used in the actual screening of materials18,58. By using band gap-based models C1 and R2, we can rapidly predict (or update) the Eg of 32,858 compounds. Our well-trained C1 has a prediction accuracy of 93% for identifying metals and semiconductors, and the PC between the Eg predicted by R2 and the experimental value is 0.933, and the MAE is only 0.347 eV (see Table 1). Therefore, the two models are of great significance to update and reuse materials in MP database. As shown in Fig. 3b, the sites are colored with their Eg values predicted by C1 and R2, where the sites with large values (> 3.0 eV) are mainly concentrated on the right side of the t-SNE plot. This phenomenon can be associated with Fig. 3a, as the types of compound elements increase, the new elements introduced are mainly non-metallic elements, such as O, S, F, Cl, Br, etc., leading to the weakening of the electronic conductivity of the material.

Fig. 3: Material discovery through C1, R2, and R7 models established by AlphaMat.
figure 3

a t-SNE plot, where the sites are colored with their compound type. b t-SNE plot, where the sites are colored with their Eg. c Correlation of predicted Eg and calculated Eg, the color bar denotes the Eg changes. d t-SNE plot, where the sites are colored with their thermal conductivity, the oval mark determines the compounds with high thermal conductivities. e Discovered compounds with high thermal conductivities (red squares) and the known compounds (blue circles). f Embedded feature importance of C1. g Embedded feature importance of R2. h Embedded feature importance of R7.

Figure 3c shows the correlation of MP calculated Eg and predicted Eg. It can be seen that the Eg changes of most materials are less than 2.0 eV (blue and purple dos), and the updated Eg is general larger (green, yellow, and red dots), which is consistent with the conclusion that the Eg calculated by low-precision functional is seriously underestimated18,59,60. Then, we identified 832 materials with Eg of 0.9–1.6 eV for photoelectric materials (HTLs, ETLs, photocatalysts, etc.), 13 materials containing Li+ with Eg > 3.5 eV for solid-state electrolytes. In addition, for searching the electrode materials, excellent electronic conductivity with Eg = 0 is necessary, as well as high mechanical properties. Referring to the shear modulus and bulk modulus of commercialized materials LiNi0.3Mn0.3Co0.3O2 (NMC333), LiNi0.4Mn0.4Co0.2O2 (NMC442), LiNi0.5Mn0.3Co0.2O2 (NMC532), LiNi0.6Mn0.2Co0.2O2 (NMC622), LiNi0.8Mn0.1Co0.1O2 (NMC811)61,62, we further selected 95 materials with shear modulus > 67 GPa and bulk modulus > 85 GPa as candidates for electrode materials. Moreover, from economic and environmental considerations, some materials containing rare precious metal elements or radioactive elements were excluded, resulting in 491, 9, and 78 materials, respectively (see Supplementary Tables 13).

Practical applications based on κ

κ is an important thermal property of electronic materials and devices. Materials with high κ (e.g., C, 2235 W m−1 K−1; BN, 1600 W m−1 K−1) can be used to solve the heat dissipation problem of electronic products, and the development of new thermal conductivity materials will provide strong support for future space exploration activities and ocean exploration activities63,64. Among the 12 case studies in Table 1, R7 can predict the κ of given materials. By using element property as the material descriptor, we made use of well-trained R7 models for searching materials. As shown in Fig. 3d, the t-SNE plot shows that most materials have very small κ (< 100 W m−1 K−1), and materials with κ > 100 W m−1 K−1 are very concentrated (see the oval mark). Figure 3e shows the discovered materials (red squares) with high κ, such as B6O (408.7 W m−1 K−1), B13C2 (407.7 W m−1 K−1), B6P (355.0 W m−1 K−1), and BeCN2 (296.0 W m−1 K−1). The new thermal-conductivity materials can be comparable to the famous GaN (210.0 W m−1 K−1), which have a broad prospect in the application of optoelectronics, high temperature high power devices and high frequency microwave devices (see Supplementary Table 4).

The predicted Eg and κ of 32,858 materials at experimental level may be of wide interest to the experimental community in multiple areas of research (batteries, catalysis, electronics, etc.). In addition to establish the high-precision QSPR, AlphaMat also provides the interpretability of the model, which is a unique feature. Figure 3f–h shows the embedded feature importance of C1, R2, and R7, respectively. For C1, the mean number of valence electrons of p orbitals (MNVEp, 13%) in compounds and the mean of periodic table rows (MPTR, 2.7%) play a key role in distinguishing metals from semiconductors. This has guiding significance for the design of corresponding materials. The fraction of B (fracB, 2.7%) and Ta (fracTa, 2.5%) are also important due to the compounds containing B in the data are mainly semiconductors, while those containing Ta are metals in training data set. For R2, the mean electronegativity (ME, 14.3%), the fraction of valence electrons of p orbitals (FVEp, 10.7%), the mean of periodic table columns (MPTC, 6.7%), and the fraction of F (fracF, 3.9%) are relatively important for predicting Eg values. For R7, the fraction of valence electrons of s orbital (FVEs, 31.1%) and MNVEP (16.2%) are particularly important for thermal conductivity prediction, which is consistent with the phenomenon that heat conduction is mainly the diffusion of free electrons from the high end to the low end, resulting in heat flow. These key features are of great significance for further directed design of functional materials65.

Practical applications based on ΔE

UL methods are based on unlabeled data, can completely overcome the obstacle of scarce material attributes. However, UL module is still a gap in many existing material informatic platforms. In above case studies, the data scale of ΔE between AB2-type 2D materials and Li2S6 is few (only 65 entries)66, which is not conducive to establish the QSPR. The search for materials with strong adsorption (|ΔE | > 1.0 eV) for Li2S6 is helpful to discover new cathode materials for lithium-sulfur (Li-S) batteries and inhibit the “shuttle effect”. Here, we demonstrated an UL method for discovering new cathodes for Li-S batteries. Total 826 stable AB2-type compounds were selected from the 2DMatPedia database, of which 65 materials have known adsorption energies with Li2S6, and the remaining 761 were unknown67. Figure 4a shows the bottom-up tree diagram (dendrogram) by using the agglomerative hierarchical clustering (AHC) algorithm in AlphaMat, where a suitable partition line was selected and the 826 AB2-type compounds were classified into seven groups (see Supplementary Fig. 13, from G1, G2, …, to G7). We mapped 65 known ΔE to the dendrogram, and compounds marked by green, orange and red are promising according to ideal thresholds (−1.0 eV)66. The clustering of AB2-type compounds provides physical insights into understanding of compounds exhibiting proper adsorption energies for Li2S6. Figure 4b gives the statistic of known and unknown compounds each group, G4 has the most compounds of 319, while G3 has the fewest compounds of 29, indicating that a targeted study of these groups would significantly narrow down the initial scope (761 unknown compounds). Figure 4c shows the ratio of known compounds (black line) and the ratio of desired compounds (blue line) of each group. Notably, in G1, G3, and G5, the ratio of desired compounds to known compounds is 100%, which is much higher than that in other groups. This phenomenon can also be observed in Fig. 4a. These suggest that the unknown compounds in G1, G3, and G5 are worthy of further investigation (142 compounds in total), and that they may also be potential cathode materials for Li-S batteries. The violin plots of the known ΔE shown in Fig. 4d further reveal that G5 is of high research value because of its higher average absolute adsorption energy value (1.62 eV). As a result, the scope of exploration narrowed from 761 compounds to 84 compounds in G5. Moreover, compounds containing rare precious metal elements or radioactive elements were excluded, resulting in 39 compounds finally, respectively, as provided in Supplementary Figs. 14, 15. More details about the position of the partition line are discussed in Supplementary Note 18.

Fig. 4: Unsupervised discovery of AB2-type compounds as cathodes for Li-S batteries.
figure 4

a Dendrogram generated by the AHC method in AlphaMat. The dashed line shows the position where all compounds are partitioned into seven groups, marked as G1G7 from left to right and distinguished by different colors. b Statistic of known compounds in each group. Gray bars represent the number of compounds in each group after clustering, and red represents the number of known compounds in each group. c Ratio of known compounds (black line) and the ratio of desired compounds (blue line) of each group. d Violin plots of ΔE in seven groups. The outer shells of the violins bound all data, narrow horizontal lines bound 90% of the data, thick horizontal lines bound 50% of the data, and white dots represent means.

Discussion

The challenges of material informatics prompt us to develop an advanced computational infrastructure. In this work, we presented an AI paltform that supports the whole life cycle of material modeling, including data analysis, feature engineering, model establishment and optimization, evaluation to result analysis. The proposed AlphaMat integrates supervised SL, TL, and UL simultaneously, which can tackle the tasks in material science more comprehensively. Furthermore, AlphaMat establishes proprietary databases with more than 117,000 material-property entries (see http://www.aimslab.cn). Since AlphaMat runs locally, the training of its AI models is not limited by the scale of data sets (from 101 to 106). Consequently, AlphaMat will accelerate the innovative discovery of new materials, new functions, and new principles, compard to the trial-and-error experiments and high-throughput calculation methods. 12 case studies of material modelings (formation energy, band gap, magnetism, adsorption energy, thermal conductivity, and ionic conductivity, etc) demonstrate the effectiveness and usefulness of AlphaMat, and the practical application in searching high-performance materials demonstrates AlphaMat’s ability to mine and design materials that it successfully identify new materials for use in various systems (photonics, batteries, catalysis, and capacitors, etc.) from the large inorganic compound databases. Using AlphaMat, users can either directly retrieve our database or easily build AI models to discover and design materials.

It should be mentioned that ML is only as good as the data it is trained on and predictions using data outside the training set are likely to fail dramatically. Therefore, the prediction results of the ML model will be uncertain to a certain extent. In the face of more complex problems, traditional computational methods or experimental methods are also needed for further verification. But at the very least, ML offers specific candidates to speed up material development. Further, we will continue to improve and release AlphaMat to address the challenges commonly encountered in material modeling: (1) continuously expand the databases according to the material systems (fullerenes, nanocomposites, metamaterials, etc.) and properties (superconductivity, optical coefficient, etc.) to alleviate the challenge of data scarcity and make it available to more scientists in different material subfields, with the using of AI methods (e.g., natural language processing, generative model); (2) integrate more popular component and structural descriptor, and innovate new descriptors to represent the materials, improve model accuracy and make models interpretable; (3) combine frontier AI algorithms timely to cope with more material modeling tasks; (4) add more convenient tools and visualization interface to improve the efficiency for processing material data. We hope that the continually released AlphaMat will deeply unite material science and AI approaches, and become an essential tool in science researches.

Methods

Architecture

Various material data can be generated/collected from simulations, experiments, literatures (manually collect data from published papers), and open databases. For software-generated material structure files containing atomic information, batch conversion tasks between files needed to be completed first, as shown in Fig. 2a. The material descriptors are then constructed based on the component and structural information (Fig. 2b). For the data in plain text format, the data preprocessing module can be directly carried out (Fig. 2c). Four main learning tasks (classification, regression, clustering, and dimensionality reduction), three type models (supervised learning, transfer learning, and unsupervised learning), and different AI models have been designed and integrated in AlphaMat (Fig. 2d). Furthermore, considering the importance of hyper-parameters in AI models, AlphaMat also provides two commonly used optimization methods to search the optimal hyper-parameters. In addition to the ease of AI modeling, we integrated various portable material tools in AlphaMat (Fig. 2e). Moreover, AlphaMat aims to comprise all kinds of material databases and categorize them according to the material properties (Fig. 2f). Finally, data, features and models can be automatically saved in the current directory for further visual analysis (Fig. 2g). The whole development and use process of AlphaMat closely combines the components, structures, and properties of materials with AI (data, features, and models), which is expected to be widely used in various material systems (superconducting materials, battery materials, alloy materials, etc.; Fig. 2h).

The core elements and architecture can be found in Fig. 1 and Fig. 2. Python was used as the AlphaMat primary back-end programming language to complete each function. More implementation details are provided in Supplementary Information.