Skip to main content
Log in

Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review

  • Review
  • Published:
International Journal of Precision Engineering and Manufacturing-Green Technology Aims and scope Submit manuscript

Abstract

In the development of new industries, there is a growing demand for innovative materials. However, locating such materials is a laborious and time-consuming endeavor. In response, there has been a shift toward studying new materials more efficiently using existing material science research knowledge. There has been an increase in the number of materials science-related papers over the past two decades, and attempts to use them for research purposes have increased as the methods have been systematized. Past research papers, for instance, can be used to predict new materials or obtain optimal synthesis parameters for materials with the desired properties. In this movement, natural language processing (NLP) is a crucial technology. In the past decade, NLP has emerged as one of the most rapidly expanding areas of artificial intelligence, proving to be a valuable tool for processing language-based data. In this review, we will examine how NLP is used in the materials science literature, what processes it can be used for, and the primary NLP technologies currently in use, with a particular focus on specific use cases. We will also discuss this approach's limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Wikipedia, “History of materials science,” Wikipedia Foundation, Feb. 12, 2023. https://en.wikipedia.org/wiki/History_of_materials_science Accessed 24 Feb. 2023.

  2. Tolle, K. M., Tansley, D. S. W., & Hey, A. J. G. (2011). The fourth paradigm: data-intensive scientific discovery [point of view]. Proceedings of the IEEE, 99(8), 1334–1337. https://doi.org/10.1109/JPROC.2011.2155130

    Article  Google Scholar 

  3. Curtarolo, S., Hart, G. L. W., Nardelli, M. B., Mingo, N., Sanvito, S., & Levy, O. (2013). The high-throughput highway to computational materials design. Nature Materials, 12(3), 191–201. https://doi.org/10.1038/nmat3568

    Article  Google Scholar 

  4. Haldoupis, E., Nair, S., & Sholl, D. S. (2012). Finding MOFs for highly selective CO2/N2 adsorption Using materials screening based on efficient assignment of atomic point charges. Journal of the American Chemical Society, 134(9), 4313–4323. https://doi.org/10.1021/ja2108239

    Article  Google Scholar 

  5. Gaultois, M. W., Sparks, T. D., Borg, C. K. H., Seshadri, R., Bonificio, W. D., & Clarke, D. R. (2013). Data-driven review of thermoelectric materials: performance and resource considerations. Chemistry of Materials, 25(15), 2911–2920. https://doi.org/10.1021/cm400893e

    Article  Google Scholar 

  6. Ghadbeigi, L., Harada, J. K., Lettiere, B. R., & Sparks, T. D. (2015). Performance and resource considerations of Li-ion battery electrode materials. Energy Environmental Science, 8(6), 1640–1650. https://doi.org/10.1039/C5EE00685F

    Article  Google Scholar 

  7. Kim, E., Huang, K., Saunders, A., McCallum, A., Ceder, G., & Olivetti, E. (2017). Materials synthesis insights from scientific literature via text extraction and machine learning. Chemistry of Materials, 29(21), 9436–9444. https://doi.org/10.1021/acs.chemmater.7b03500

    Article  Google Scholar 

  8. Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J., & Valencia, A. (2017). Information retrieval and text mining technologies for chemistry. Chemical Reviews, 117(12), 7673–7761. https://doi.org/10.1021/acs.chemrev.6b00851

    Article  Google Scholar 

  9. Kim, E., et al. (2017). Machine-learned and codified synthesis parameters of oxide materials. Sci Data., 4, 170127. https://doi.org/10.1038/sdata.2017.127

    Article  Google Scholar 

  10. Pfeiffer, O. P., et al. (2022). Aluminum alloy compositions and properties extracted from a corpus of scientific manuscripts and US patents. Sci Data, 9(1), 128. https://doi.org/10.1038/s41597-022-01215-7

    Article  Google Scholar 

  11. El-Bousiydy, H., et al. (2021). What can text mining tell us about lithium-ion battery researchers’ habits? Batter Supercaps, 4(5), 758–766. https://doi.org/10.1002/batt.202000288

    Article  Google Scholar 

  12. Shetty, P., & Ramprasad, R. (2021). Automated knowledge extraction from polymer literature using natural language processing. Science, 24, 101922. https://doi.org/10.1016/j.isci.2020.101922

    Article  Google Scholar 

  13. Kononova, O., et al. (2019). Text-mined dataset of inorganic materials synthesis recipes. Sci Data, 6(1), 203. https://doi.org/10.1038/s41597-019-0224-1

    Article  Google Scholar 

  14. He, T., et al. (2020). Similarity of precursors in solid-state synthesis as text-mined from scientific literature. Chemistry of Materials, 32(18), 7861–7873. https://doi.org/10.1021/acs.chemmater.0c02553

    Article  Google Scholar 

  15. Kim, E., et al. (2020). Inorganic materials synthesis planning with literature-trained neural networks. Journal of Chemical Information and Modeling, 60(3), 1194–1201. https://doi.org/10.1021/acs.jcim.9b00995

    Article  Google Scholar 

  16. Beard, E. J., Sivaraman, G., Vázquez-Mayagoitia, Á., Vishwanath, V., & Cole, J. M. (2019). Comparative dataset of experimental and computational attributes of UV/vis absorption spectra. Sci Data, 6(1), 307. https://doi.org/10.1038/s41597-019-0306-0

    Article  Google Scholar 

  17. Dong, Q., & Cole, J. M. (2022). Auto-generated database of semiconductor band gaps using Chem Data Extractor. Sci Data, 9(1), 193. https://doi.org/10.1038/s41597-022-01294-6

    Article  Google Scholar 

  18. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, 2019 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in Proceedings of the 2019 Conference of the North, Stroudsburg, PA, USA: Association for Computational Linguistics. 4171–4186. doi: https://doi.org/10.18653/v1/N19-1423.

  19. Gupta, T., Zaki, M., & Krishnan, N. M. A. (2022). MatSciBERT: A materials domain language model for text mining and information extraction. NPJ Computational Materials, 8, 102. https://doi.org/10.1038/s41524-022-00784-w

    Article  Google Scholar 

  20. Court, C. J., & Cole, J. M. (2020). Magnetic and superconducting phase diagrams and transition temperatures predicted using text mining and machine learning. NPJ Computational Materials, 6(1), 18. https://doi.org/10.1038/s41524-020-0287-8

    Article  Google Scholar 

  21. Nandy, A., Terrones, G., Arunachalam, N., Duan, C., Kastner, D. W., & Kulik, H. J. (2022). MOFSimplify, machine learning models with extracted stability data of three thousand metal–organic frameworks. Sci Data, 9(1), 74. https://doi.org/10.1038/s41597-022-01181-0

    Article  Google Scholar 

  22. D. C. Elton et al., 2019 Using Natural Language Processing Techniques To Extract Information On The Properties And Functionalities Of Energetic Materials From Large Text Corpora Article Has Been Accepted For Publication In The Proceedings Of The 22nd International Seminar In New Trends In Research Of Energetic Materials.

  23. Venugopal, V., Sahoo, S., Zaki, M., Agarwal, M., Gosvami, N. N., & Krishnan, N. M. A. (2021). Looking through glass: Knowledge discovery from materials science literature using natural language. Patterns., 2, 100290. https://doi.org/10.1016/j.patter.2021.100290

    Article  Google Scholar 

  24. Court, C. J., & Cole, J. M. (2018). Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction. Sci Data., 5, 180111. https://doi.org/10.1038/sdata.2018.111

    Article  Google Scholar 

  25. Kononova, O., He, T., Huo, H., Trewartha, A., Olivetti, E. A., & Ceder, G. (2021). Opportunities and challenges of text mining in materials research. Science, 24, 102155. https://doi.org/10.1016/j.isci.2021.102155

    Article  Google Scholar 

  26. Mavračić, J., Court, C. J., Isazawa, T., Elliott, S. R., & Cole, J. M. (2021). ChemDataExtractor 2.0: autopopulated ontologies for materials science. Journal of Chemical Information and Modeling, 61(9), 4280–4289. https://doi.org/10.1021/acs.jcim.1c00446

    Article  Google Scholar 

  27. Lammey, R. (2015). CrossRef text and data mining services. Insights the UKSG journal, 28(2), 62–68. https://doi.org/10.1629/uksg.233

    Article  Google Scholar 

  28. Wang, Z., et al. (2022). Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature. Sci Data, 9(1), 231. https://doi.org/10.1038/s41597-022-01317-2

    Article  MathSciNet  Google Scholar 

  29. Beard, E. J., & Cole, J. M. (2022). Perovskite- and dye-sensitized solar-cell device databases auto-generated using chemdataextractor. Sci Data, 9(1), 329. https://doi.org/10.1038/s41597-022-01355-w

    Article  Google Scholar 

  30. Swain, M. C., & Cole, J. M. (2016). ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. Journal of Chemical Information and Modeling, 56(10), 1894–1904. https://doi.org/10.1021/acs.jcim.6b00207

    Article  Google Scholar 

  31. Kumar, P., Kabra, S., & Cole, J. M. (2022). Auto-generating databases of Yield Strength and Grain Size using ChemDataExtractor. Sci Data, 9(1), 292. https://doi.org/10.1038/s41597-022-01301-w

    Article  Google Scholar 

  32. E. Agichtein and L. Gravano, 2000 Snowball: extracting relations from large plain-text collections in Proceedings of the fifth ACM conference on Digital libraries, New York, NY, USA: doi: https://doi.org/10.1145/336597.336644

  33. Sierepeklis, O., & Cole, J. M. (2022). A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Sci Data, 9(1), 648. https://doi.org/10.1038/s41597-022-01752-1

    Article  Google Scholar 

  34. IESL, “watr-works,” GitHub, Inc, Oct. 31, 2019. https://github.com/iesl/watr-works Accessed 24 Feb. 2023.

  35. pprzetacznik, “patent-parsing-tools,” Github, Inc, Nov. 29, 2020. https://github.com/pprzetacznik/patent-parsing-tools Accessed 24 Feb. 2023.

  36. mcs07, “CIRpy,” Github, Inc, Jan. 05, 2016. https://github.com/mcs07/CIRpy Accessed 24 Feb. 2023.

  37. Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Modeling, 28(1), 31–36. https://doi.org/10.1021/ci00057a005

    Article  Google Scholar 

  38. S. Bird, E. Klein, and E. Loper, 2009 Natural Language Processing with Python. O’Reilly Media, Inc.

  39. M. Honnibal and M. Johnson, 2015 An Improved Non-monotonic Transition System for Dependency Parsing, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA: Association for Computational Linguistics. doi: https://doi.org/10.18653/v1/D15-1162.

  40. Jensen, Z., et al. (2019). A machine learning approach to zeolite synthesis enabled by automatic literature data extraction. ACS Central Science, 5(5), 892–899. https://doi.org/10.1021/acscentsci.9b00193

    Article  Google Scholar 

  41. Wikipedia, “Regular expression,” Wikipedia Foundation, 2023. https://en.wikipedia.org/wiki/Regular_expression Accessed 24 Feb 2023.

  42. Francesco Elia, “Constituency Parsing vs Dependency Parsing, baeldung, 2022. https://www.baeldung.com/cs/constituency-vs-dependency-parsing Accessed 24 Feb 2023.

  43. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space. 2013. [Online]. Available: http://ronan.collobert.com/senna/

  44. M. Peters et al., 2018 Deep Contextualized Word Representations, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Stroudsburg, PA, USA: Association for Computational Linguistics. doi: https://doi.org/10.18653/v1/N18-1202.

  45. Tshitoyan, V., et al. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95–98. https://doi.org/10.1038/s41586-019-1335-8

    Article  Google Scholar 

  46. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Trans Assoc Comput Linguist, 5, 135–146. https://doi.org/10.1162/tacl_a_00051

    Article  Google Scholar 

  47. Wikipedia, “Named-entity recognition,” Wikipedia Foundation, 2023. https://en.wikipedia.org/wiki/Named-entity_recognition accessed 24 Feb 2023.

  48. Trewartha, A., et al. (2022). “Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns., 3, 100488. https://doi.org/10.1016/j.patter.2022.100488

    Article  Google Scholar 

  49. Saal, J. E., Kirklin, S., Aykol, M., Meredig, B., & Wolverton, C. (2013). Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD). JOM Journal of the Minerals Metals and Materials Society, 65(11), 1501–1509. https://doi.org/10.1007/s11837-013-0755-4

    Article  Google Scholar 

  50. Jain, A., et al. (2018). The Materials Project: Accelerating Materials Design Through Theory-Driven Data and Tools. In W. Andreoni (Ed.), Handbook of Materials Modeling: Methods: Theory and Modeling. Cham: Springer International Publishing.

    Google Scholar 

  51. Choudhary, K., et al. (2020). The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. NPJ Computational Materials, 6(1), 173. https://doi.org/10.1038/s41524-020-00440-1

    Article  Google Scholar 

  52. Curtarolo, S., et al. (2012). “AFLOW: An automatic framework for high-throughput materials discovery. Computational Materials Science. https://doi.org/10.1016/j.commatsci.2012.02.005

    Article  Google Scholar 

  53. Gupta, T., Zaki, M., & Krishnan, N. M. A. (2022). and Mausam, “MatSciBERT: A materials domain language model for text mining and information extraction. NPJ Computational Materials, 8, 1. https://doi.org/10.1038/s41524-022-00784-w

    Article  Google Scholar 

  54. Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines., 30, 4. https://doi.org/10.1007/s11023-020-09548-1

    Article  Google Scholar 

  55. Tang, T.-P., Yang, M.-R., & Chen, K.-S. (2000). Photoluminescence of ZnS: Sm phosphor prepared in a reductive atmosphere. Ceramics International, 26(2), 153–158. https://doi.org/10.1016/S0272-8842(99)00034-6

    Article  Google Scholar 

  56. Baibakova, V., Elzouka, M., Lubner, S., Prasher, R., & Jain, A. (2022). Optical emissivity dataset of multi-material heterogeneous designs generated with automated figure extraction. Sci Data, 9(1), 589. https://doi.org/10.1038/s41597-022-01699-3

    Article  Google Scholar 

  57. Wang, W., et al. (2022). Automated pipeline for superalloy data by text mining. NPJ Computational Materials, 8(1), 9. https://doi.org/10.1038/s41524-021-00687-2

    Article  Google Scholar 

  58. Cruse, K., et al. (2022). Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities. Sci Data, 9(1), 234. https://doi.org/10.1038/s41597-022-01321-6

    Article  Google Scholar 

  59. Zhao, J., & Cole, J. M. (2022). A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor. Sci Data, 9(1), 192. https://doi.org/10.1038/s41597-022-01295-5

    Article  Google Scholar 

  60. Huang, S., & Cole, J. M. (2020). A database of battery materials auto-generated using ChemDataExtractor. Sci Data, 7(1), 260. https://doi.org/10.1038/s41597-020-00602-2

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1F1A1074339, No. 2022R1C1C1009387).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kyoungmin Min.

Ethics declarations

Conflict Of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an invited paper (Invited Review).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, J.H., Lee, M. & Min, K. Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review. Int. J. of Precis. Eng. and Manuf.-Green Tech. 10, 1337–1349 (2023). https://doi.org/10.1007/s40684-023-00523-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40684-023-00523-6

Keywords

Navigation