Abstract
In the development of new industries, there is a growing demand for innovative materials. However, locating such materials is a laborious and time-consuming endeavor. In response, there has been a shift toward studying new materials more efficiently using existing material science research knowledge. There has been an increase in the number of materials science-related papers over the past two decades, and attempts to use them for research purposes have increased as the methods have been systematized. Past research papers, for instance, can be used to predict new materials or obtain optimal synthesis parameters for materials with the desired properties. In this movement, natural language processing (NLP) is a crucial technology. In the past decade, NLP has emerged as one of the most rapidly expanding areas of artificial intelligence, proving to be a valuable tool for processing language-based data. In this review, we will examine how NLP is used in the materials science literature, what processes it can be used for, and the primary NLP technologies currently in use, with a particular focus on specific use cases. We will also discuss this approach's limitations.
Similar content being viewed by others
References
Wikipedia, “History of materials science,” Wikipedia Foundation, Feb. 12, 2023. https://en.wikipedia.org/wiki/History_of_materials_science Accessed 24 Feb. 2023.
Tolle, K. M., Tansley, D. S. W., & Hey, A. J. G. (2011). The fourth paradigm: data-intensive scientific discovery [point of view]. Proceedings of the IEEE, 99(8), 1334–1337. https://doi.org/10.1109/JPROC.2011.2155130
Curtarolo, S., Hart, G. L. W., Nardelli, M. B., Mingo, N., Sanvito, S., & Levy, O. (2013). The high-throughput highway to computational materials design. Nature Materials, 12(3), 191–201. https://doi.org/10.1038/nmat3568
Haldoupis, E., Nair, S., & Sholl, D. S. (2012). Finding MOFs for highly selective CO2/N2 adsorption Using materials screening based on efficient assignment of atomic point charges. Journal of the American Chemical Society, 134(9), 4313–4323. https://doi.org/10.1021/ja2108239
Gaultois, M. W., Sparks, T. D., Borg, C. K. H., Seshadri, R., Bonificio, W. D., & Clarke, D. R. (2013). Data-driven review of thermoelectric materials: performance and resource considerations. Chemistry of Materials, 25(15), 2911–2920. https://doi.org/10.1021/cm400893e
Ghadbeigi, L., Harada, J. K., Lettiere, B. R., & Sparks, T. D. (2015). Performance and resource considerations of Li-ion battery electrode materials. Energy Environmental Science, 8(6), 1640–1650. https://doi.org/10.1039/C5EE00685F
Kim, E., Huang, K., Saunders, A., McCallum, A., Ceder, G., & Olivetti, E. (2017). Materials synthesis insights from scientific literature via text extraction and machine learning. Chemistry of Materials, 29(21), 9436–9444. https://doi.org/10.1021/acs.chemmater.7b03500
Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J., & Valencia, A. (2017). Information retrieval and text mining technologies for chemistry. Chemical Reviews, 117(12), 7673–7761. https://doi.org/10.1021/acs.chemrev.6b00851
Kim, E., et al. (2017). Machine-learned and codified synthesis parameters of oxide materials. Sci Data., 4, 170127. https://doi.org/10.1038/sdata.2017.127
Pfeiffer, O. P., et al. (2022). Aluminum alloy compositions and properties extracted from a corpus of scientific manuscripts and US patents. Sci Data, 9(1), 128. https://doi.org/10.1038/s41597-022-01215-7
El-Bousiydy, H., et al. (2021). What can text mining tell us about lithium-ion battery researchers’ habits? Batter Supercaps, 4(5), 758–766. https://doi.org/10.1002/batt.202000288
Shetty, P., & Ramprasad, R. (2021). Automated knowledge extraction from polymer literature using natural language processing. Science, 24, 101922. https://doi.org/10.1016/j.isci.2020.101922
Kononova, O., et al. (2019). Text-mined dataset of inorganic materials synthesis recipes. Sci Data, 6(1), 203. https://doi.org/10.1038/s41597-019-0224-1
He, T., et al. (2020). Similarity of precursors in solid-state synthesis as text-mined from scientific literature. Chemistry of Materials, 32(18), 7861–7873. https://doi.org/10.1021/acs.chemmater.0c02553
Kim, E., et al. (2020). Inorganic materials synthesis planning with literature-trained neural networks. Journal of Chemical Information and Modeling, 60(3), 1194–1201. https://doi.org/10.1021/acs.jcim.9b00995
Beard, E. J., Sivaraman, G., Vázquez-Mayagoitia, Á., Vishwanath, V., & Cole, J. M. (2019). Comparative dataset of experimental and computational attributes of UV/vis absorption spectra. Sci Data, 6(1), 307. https://doi.org/10.1038/s41597-019-0306-0
Dong, Q., & Cole, J. M. (2022). Auto-generated database of semiconductor band gaps using Chem Data Extractor. Sci Data, 9(1), 193. https://doi.org/10.1038/s41597-022-01294-6
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, 2019 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in Proceedings of the 2019 Conference of the North, Stroudsburg, PA, USA: Association for Computational Linguistics. 4171–4186. doi: https://doi.org/10.18653/v1/N19-1423.
Gupta, T., Zaki, M., & Krishnan, N. M. A. (2022). MatSciBERT: A materials domain language model for text mining and information extraction. NPJ Computational Materials, 8, 102. https://doi.org/10.1038/s41524-022-00784-w
Court, C. J., & Cole, J. M. (2020). Magnetic and superconducting phase diagrams and transition temperatures predicted using text mining and machine learning. NPJ Computational Materials, 6(1), 18. https://doi.org/10.1038/s41524-020-0287-8
Nandy, A., Terrones, G., Arunachalam, N., Duan, C., Kastner, D. W., & Kulik, H. J. (2022). MOFSimplify, machine learning models with extracted stability data of three thousand metal–organic frameworks. Sci Data, 9(1), 74. https://doi.org/10.1038/s41597-022-01181-0
D. C. Elton et al., 2019 Using Natural Language Processing Techniques To Extract Information On The Properties And Functionalities Of Energetic Materials From Large Text Corpora Article Has Been Accepted For Publication In The Proceedings Of The 22nd International Seminar In New Trends In Research Of Energetic Materials.
Venugopal, V., Sahoo, S., Zaki, M., Agarwal, M., Gosvami, N. N., & Krishnan, N. M. A. (2021). Looking through glass: Knowledge discovery from materials science literature using natural language. Patterns., 2, 100290. https://doi.org/10.1016/j.patter.2021.100290
Court, C. J., & Cole, J. M. (2018). Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction. Sci Data., 5, 180111. https://doi.org/10.1038/sdata.2018.111
Kononova, O., He, T., Huo, H., Trewartha, A., Olivetti, E. A., & Ceder, G. (2021). Opportunities and challenges of text mining in materials research. Science, 24, 102155. https://doi.org/10.1016/j.isci.2021.102155
Mavračić, J., Court, C. J., Isazawa, T., Elliott, S. R., & Cole, J. M. (2021). ChemDataExtractor 2.0: autopopulated ontologies for materials science. Journal of Chemical Information and Modeling, 61(9), 4280–4289. https://doi.org/10.1021/acs.jcim.1c00446
Lammey, R. (2015). CrossRef text and data mining services. Insights the UKSG journal, 28(2), 62–68. https://doi.org/10.1629/uksg.233
Wang, Z., et al. (2022). Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature. Sci Data, 9(1), 231. https://doi.org/10.1038/s41597-022-01317-2
Beard, E. J., & Cole, J. M. (2022). Perovskite- and dye-sensitized solar-cell device databases auto-generated using chemdataextractor. Sci Data, 9(1), 329. https://doi.org/10.1038/s41597-022-01355-w
Swain, M. C., & Cole, J. M. (2016). ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. Journal of Chemical Information and Modeling, 56(10), 1894–1904. https://doi.org/10.1021/acs.jcim.6b00207
Kumar, P., Kabra, S., & Cole, J. M. (2022). Auto-generating databases of Yield Strength and Grain Size using ChemDataExtractor. Sci Data, 9(1), 292. https://doi.org/10.1038/s41597-022-01301-w
E. Agichtein and L. Gravano, 2000 Snowball: extracting relations from large plain-text collections in Proceedings of the fifth ACM conference on Digital libraries, New York, NY, USA: doi: https://doi.org/10.1145/336597.336644
Sierepeklis, O., & Cole, J. M. (2022). A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Sci Data, 9(1), 648. https://doi.org/10.1038/s41597-022-01752-1
IESL, “watr-works,” GitHub, Inc, Oct. 31, 2019. https://github.com/iesl/watr-works Accessed 24 Feb. 2023.
pprzetacznik, “patent-parsing-tools,” Github, Inc, Nov. 29, 2020. https://github.com/pprzetacznik/patent-parsing-tools Accessed 24 Feb. 2023.
mcs07, “CIRpy,” Github, Inc, Jan. 05, 2016. https://github.com/mcs07/CIRpy Accessed 24 Feb. 2023.
Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Modeling, 28(1), 31–36. https://doi.org/10.1021/ci00057a005
S. Bird, E. Klein, and E. Loper, 2009 Natural Language Processing with Python. O’Reilly Media, Inc.
M. Honnibal and M. Johnson, 2015 An Improved Non-monotonic Transition System for Dependency Parsing, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA: Association for Computational Linguistics. doi: https://doi.org/10.18653/v1/D15-1162.
Jensen, Z., et al. (2019). A machine learning approach to zeolite synthesis enabled by automatic literature data extraction. ACS Central Science, 5(5), 892–899. https://doi.org/10.1021/acscentsci.9b00193
Wikipedia, “Regular expression,” Wikipedia Foundation, 2023. https://en.wikipedia.org/wiki/Regular_expression Accessed 24 Feb 2023.
Francesco Elia, “Constituency Parsing vs Dependency Parsing, baeldung, 2022. https://www.baeldung.com/cs/constituency-vs-dependency-parsing Accessed 24 Feb 2023.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space. 2013. [Online]. Available: http://ronan.collobert.com/senna/
M. Peters et al., 2018 Deep Contextualized Word Representations, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Stroudsburg, PA, USA: Association for Computational Linguistics. doi: https://doi.org/10.18653/v1/N18-1202.
Tshitoyan, V., et al. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95–98. https://doi.org/10.1038/s41586-019-1335-8
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Trans Assoc Comput Linguist, 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Wikipedia, “Named-entity recognition,” Wikipedia Foundation, 2023. https://en.wikipedia.org/wiki/Named-entity_recognition accessed 24 Feb 2023.
Trewartha, A., et al. (2022). “Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns., 3, 100488. https://doi.org/10.1016/j.patter.2022.100488
Saal, J. E., Kirklin, S., Aykol, M., Meredig, B., & Wolverton, C. (2013). Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD). JOM Journal of the Minerals Metals and Materials Society, 65(11), 1501–1509. https://doi.org/10.1007/s11837-013-0755-4
Jain, A., et al. (2018). The Materials Project: Accelerating Materials Design Through Theory-Driven Data and Tools. In W. Andreoni (Ed.), Handbook of Materials Modeling: Methods: Theory and Modeling. Cham: Springer International Publishing.
Choudhary, K., et al. (2020). The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. NPJ Computational Materials, 6(1), 173. https://doi.org/10.1038/s41524-020-00440-1
Curtarolo, S., et al. (2012). “AFLOW: An automatic framework for high-throughput materials discovery. Computational Materials Science. https://doi.org/10.1016/j.commatsci.2012.02.005
Gupta, T., Zaki, M., & Krishnan, N. M. A. (2022). and Mausam, “MatSciBERT: A materials domain language model for text mining and information extraction. NPJ Computational Materials, 8, 1. https://doi.org/10.1038/s41524-022-00784-w
Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines., 30, 4. https://doi.org/10.1007/s11023-020-09548-1
Tang, T.-P., Yang, M.-R., & Chen, K.-S. (2000). Photoluminescence of ZnS: Sm phosphor prepared in a reductive atmosphere. Ceramics International, 26(2), 153–158. https://doi.org/10.1016/S0272-8842(99)00034-6
Baibakova, V., Elzouka, M., Lubner, S., Prasher, R., & Jain, A. (2022). Optical emissivity dataset of multi-material heterogeneous designs generated with automated figure extraction. Sci Data, 9(1), 589. https://doi.org/10.1038/s41597-022-01699-3
Wang, W., et al. (2022). Automated pipeline for superalloy data by text mining. NPJ Computational Materials, 8(1), 9. https://doi.org/10.1038/s41524-021-00687-2
Cruse, K., et al. (2022). Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities. Sci Data, 9(1), 234. https://doi.org/10.1038/s41597-022-01321-6
Zhao, J., & Cole, J. M. (2022). A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor. Sci Data, 9(1), 192. https://doi.org/10.1038/s41597-022-01295-5
Huang, S., & Cole, J. M. (2020). A database of battery materials auto-generated using ChemDataExtractor. Sci Data, 7(1), 260. https://doi.org/10.1038/s41597-020-00602-2
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1F1A1074339, No. 2022R1C1C1009387).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict Of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper is an invited paper (Invited Review).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lee, J.H., Lee, M. & Min, K. Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review. Int. J. of Precis. Eng. and Manuf.-Green Tech. 10, 1337–1349 (2023). https://doi.org/10.1007/s40684-023-00523-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40684-023-00523-6