Importance of materials informatics education

Materials discovery has enormously benefited from the application of state-of-the-art machine learning (ML) and artificial intelligence (AI) methods to materials data sets, both historical and newly generated. Combining large quantities of materials data with AI/ML algorithms to develop predictive models and optimization schemes, perform high-throughput screening, and automate or accelerate experiments and computations in general, has come to be known as materials informatics. Many successful discoveries of new battery electrolytes, solar-cell absorbers, catalysts, thermoelectrics, dielectrics, and high-entropy alloys, among myriad other materials applications, are thanks to informatics-based approaches.1 Informatics has become an indispensable tool to both computational and experimental materials scientists looking to make the next big materials discovery. The field has matured to such an extent that well-established researchers feel the need to embrace these methods even late in their careers, while beginners have been tasked with learning AI/ML concepts during their materials science education.

Although some roadblocks remain to fully incorporate materials informatics into mainstream curriculum, most materials scientists recognize the importance of teaching students best data practices, ways to extract qualitative and quantitative relationships from data, and getting comfortable with programming languages such as Python. The barrier to applying AI/ML toward materials research is lower than ever before, with the sheer number of tools and resources available online. For instance, the Scikit-learn Python library contains essentially every possible ML regression, classification, clustering, and dimensionality reduction algorithm, any of which can be easily applied to suitably structured input–output data.2 With just a few lines of code, researchers can call necessary functions and train supervised ML models based on random forests, ridge regression, neural networks, and other regression algorithms, as well as immediately elevate their work using sophisticated data visualization and processing techniques. Another source providing ready access to materials informatics tools is repositories hosted on GitHub and similar services such as nanoHUB,3 typically containing reactive code notebooks such as Jupyter or Pluto notebooks,4,5 Python scripts, and executables that can be adapted and applied toward any new materials data sets.

Figure 1
figure 1

Snapshots from the workshop held at the MRS Fall Meeting in November 2022, at peak audience in the morning (top) and at the end of the day (bottom).

Materials informatics workshops

In their recent article, Sun et al.6 talk about the lessons learned from conducting an ML workshop for a materials science audience. As the successor to the workshop discussed in their article, we organized and successfully conducted an in-person hands-on tutorial titled “DS00: Machine Learning in Materials Sciences – From Basic Concepts to Active Learning,” at the Materials Research Society (MRS) Spring Meeting in Honolulu in May 2022, as well as at the MRS Fall Meeting in Boston in November 2022. The idea behind making such a tutorial a necessary and recurring feature of all MRS meetings emerged from extensive discussions between members of the MRS AI Staging Committee, which consists of several leading materials informatics researchers and is tasked with “establishing and promoting AI as a crosscutting research topic within MRS.” The tutorial provided a basic introduction to concepts of ML, covered some examples of how it is applied for materials design, and went through several detailed case studies using reactive code notebooks, which were accessible to all of the audience members who were encouraged to execute code snippets in parallel with the instructors. Despite challenges posed by the COVID-19 global pandemic and conferences still slowly building up to pre-2020 attendance, the tutorial was positively received, with between 50 and 100 people in attendance at its peak at both meetings, which has helped establish a baseline for how to conduct such introductory workshops and how best to interactively teach essential concepts of data visualization, feature generation, model training and testing, and new predictions for discovery. Figure 1 shows pictures from the workshop conducted at the 2022 MRS Fall Meeting.

Given the fact that tutorial days at MRS meetings are generally “day 0” before the official technical symposium sessions begin, and the fact that there are many other tutorials going on at the same time, the audience for tutorial DS00 constituted a satisfactory number. What was also striking was the breakdown of audience members: there were arguably as many experienced researchers as there were beginner-level students, with an equal degree of enthusiasm among all, as gauged by their active participation in opening and running through the reactive code notebooks, asking questions, and engaging in meaningful discussion during the dedicated Q&A periods, and even taking part in the ML challenge at the end of the day to apply what they learned during the tutorial and give themselves a shot at finishing at the top of a results leaderboard (discussed in the “Raising curiosity and confidence via competitions” section). Some of the audience members were indeed researchers already applying informatics-based approaches in their work, looking to consolidate their knowledge and interact with like-minded people. While the daylong tutorial is meant to be no more than a crash course in applying AI/ML toward materials science problems, it provides a vital boost to both beginners and active researchers on (a) the importance and inevitability of materials informatics, and whether/how the underlying approaches apply to their work; (b) how to get started and become comfortable with specific algorithms, keeping in mind best data and software practices; and (c) how to incorporate materials informatics education within their own research groups and institutions. In “Nuts and bolts of ML,” “Teaching via reactive code notebooks,” “Demonstrating essential concepts using an example: ML-driven perovskite discovery,” “Neural networks,” “Gaussian processes and active learning,” and “Raising curiosity and confidence via competitions” sections, we discuss some details of our ML workshops, which followed a general technical outline presented in Figure 2. Some of the topics of discussion include the nuts and bolts of ML that all researchers need to be familiar with, the advantages of using reactive code notebooks, specific approaches such as neural networks and active learning, and finally, the importance of raising curiosity and incentivizing the use of competitions.

Figure 2
figure 2

An outline of our materials informatics workshop. DFT, density functional theory.

Nuts and bolts of ML

The simplest version of applying ML to materials is when an input–output materials data set (inputs being easily attainable descriptors, outputs are properties of interest) is compiled and a supervised or unsupervised learning algorithm is applied to obtain some kind of a predictive model that can be deployed on many unseen datapoints. To this end, there are a number of essential concepts, referred to here as the “nuts and bolts of ML,” which need to be understood, and were the subject of discussions in the early part of the tutorial. These nuts and bolts include:

  • Materials data: Perhaps the most important component of a materials informatics framework is the data set. This could be computational or experimental data, it may range from a few dozen to hundreds or even millions of datapoints depending on the problem, and it could be obtained from databases such as the Materials Project7 or be freshly generated. The ability to procure reasonably accurate and sufficiently large quantities of data is key to ML predictive power and utility.

  • Materials descriptors: An enormous body of work exists on critical examination of various types of features, fingerprints, or descriptors that can be used to effectively represent materials in a data set structured for ML.8,9 Such descriptors must be easy to obtain (or learn), uniquely defined for each material, and generalizable across the entire chemical space. Common examples include using one-hot encoded composition vectors, tabulated elemental properties such as ionic radius and electronegativity, Coulomb matrices, atomic overlap functions, and cheaper computational information. Descriptors can also be learned using ML methods, especially for high-dimensional data sets such as images.

  • Data pre-processing: Once the input (descriptors) - output (properties) data are compiled, some operations are necessary before ML tasks could be performed. These include straightforward visualization of inputs versus outputs, standardization and normalization of the inputs, dimensionality reduction techniques, and lower-dimensional projection and visualization of data using methods such as principal component analysis (PCA) and t-SNE.10 These exercises help us gain an intuition for our data sets, setting prior expectations of the ML models subsequently developed.

  • Supervised versus unsupervised learning: Depending on the problem at hand, either type of approach could be applicable. Whereas supervised learning techniques find unknown functions connecting inputs to outputs from labeled data based on interpolation and extrapolation of patterns, unsupervised learning aims to find patterns in unlabeled data (i.e., “outputs” do not exist) to yield a clustering of samples.11 A typical workflow of a supervised learning algorithm is presented in Figure 3, where many of the highlighted components are the same as the ML nuts and bolts being discussed here.

  • ML algorithm: The choice of specific ML regression, classification, or optimization algorithm, although less significant than the quality of the data and descriptors, is still quite important. Typical studies involve testing of multiple algorithms before the best one is selected, or the use of an algorithm that worked well on similar data sets in the past. Neural networks work best with large data sets where physical understanding of contributions from specific features may not be key, and is frequently applied for predicting atomic forces and formation energies of solids and molecules. Methods such as random forest yield feature importance values, while Gaussian process regression helps obtain reliable uncertainty estimates along with predictions.12 There are a whole host of other methods that have been successfully applied on materials data sets and readily available via Scikit-learn and other packages.

Figure 3
figure 3

A typical supervised learning workflow, showing the various nuts and bolts of machine learning discussed in this article. Figure adapted with permission from Springer Nature.1

  • Training, validation, and testing: In practice, the data set is split into training and testing sets, with the training set used exclusively for optimizing the ML model in terms of parameters and accuracy of prediction, whereas the testing set serves as a true test of the model on unseen datapoints. The training set itself is split into multiple validation sets in a process that is called cross-validation, wherein model parameters are chosen so as to minimize error in prediction on subsets of the training set over multiple cycles of splitting the data, training then validating. Furthermore, one would visualize the effect of training data size using learning curves, which show the test prediction error as a function of the number of datapoints used to train and optimize the model, in an attempt to determine the saturation point. Such curves could also be made between number (or combinations therein) of input features or algorithm parameters and the error.

  • Hyperparameter optimization: Every ML algorithm has a set of “hyperparameters” that need to be optimized for best predictive performance. For instance, hyperparameters in random forest regression include the number of trees in the forest, maximum depth of each tree, number of features to consider when looking for the best node split, and the minimum number of samples required to be at a leaf node. Typically, finding the optimal set of all parameters is done by employing a grid-search approach where the algorithm cycles through many combinations, but there are more efficient ways to do this, such as via Bayesian optimization using a function set to minimize the validation error.13

  • New discovery and understanding: The ultimate objective of applying ML to materials data is to attain quantitative or qualitative models that can reveal atom/composition/structure–property relationships that were previously unknown, and enable predictions over a combinatorial space, which can then be used for screening and discovery.12,13,14 Once rigorously optimized, ML models need to be deployed for new discovery and understanding, which may involve enumerating new datapoints and making predictions, performing inverse design using approaches such as genetic algorithm or generative neural networks15 to obtain new materials with desired properties, and generally making best models available to be coupled with ongoing computations or experiments, often within an active learning framework16 for continuous improvement and discovery.

Teaching via reactive code notebooks

Reactive code notebooks, such as Jupyter and Pluto,4,5 represent some of the most frequently used web-based computing environments, which allow the user to author the notebook, write and run their code interactively, and add any number of details in the form of markdowns. Such notebooks are often self-sufficient and independent of related publications, data sets, or codes. They allow the inclusion of images, equations, narrative texts, plots, and videos, each designed and inserted at specific locations throughput the notebook to explain different components of the overall code and project. The reactive code notebook allows users to edit and run code from right within their browsers, and perform live visualization of results as tabulated values, plots, or in other formats. Notebooks contain kernels that allow executing code in different programming languages such as MATLAB and Python, as well as different types of cells that include live code, headings, markdowns, and raw information.

It is no wonder that one (or multiple) all-encompassing reactive code notebook(s) is the preferred medium for instructors conducting a hands-on AI/ML workshop. Specific benefits include the following: (a) all code and underlying information can be shared with the audience via public repositories such as on GitHub, with a universal url accessible to anybody; (b) the notebook (and associated data files) could either be downloaded to one’s personal machine and run locally (assuming all code dependencies are installed and up to date), or be opened using an application such as Colaboratory17 to be run on Google’s processors, ensuring that attendees can run the code (alongside the instructor) on their own laptops, making local edits or updates as and when needed; (c) attendees can easily copy particular code cells to other notebooks and adapt them toward their research projects; and (d) community-informed updates can periodically be made to the notebook, such that it is improved and enhanced every time it is used for a tutorial. It should be noted that today, in materials informatics circles, it has become common practice to release such notebooks as supporting documents alongside journal publications or conference presentations, which goes a long way toward ensuring that materials data remain FAIR (findable, accessible, interoperable, and reusable).18 The fact that the notebooks can as well be used to educate students and other researchers on the important aspects of the research project is an added bonus.

Demonstrating essential concepts using an example: ML-driven perovskite discovery

An “ML for materials science” workshop can take many different forms depending on the subject matter it aims to cover. It could, for instance, be focused on something very specific such as understanding and applying graph-based neural networks for materials discovery,19 accelerating molecular dynamics via ML-predicted atomic forces and energies,20 or performing automated and efficient experimental discovery using active learning.21 In an introductory/overview type of workshop, our focus is on using an intuitive, easy-to-understand, and preferably published example that involves the use of the specific concepts covered in the first part of the tutorial: collecting materials data, obtaining or learning descriptors, processing, optimizing, and visualizing results from various learning approaches, and deploying ML models for new discovery. As such, we opted for a project involving the discovery of novel halide perovskite alloys for optoelectronic applications, using ML models trained upon a substantial computational data set of relevant properties.14,22

A Jupyter notebook laying out this entire project can be found on GitHub23 as well as on nanoHUB;24 this includes all code necessary for reading and visualizing the data (properties + different types of descriptors), performing clustering and examining correlations, training different types of regression models, and coupling the best predictive models with genetic algorithm to discover new perovskite compositions that show a targeted set of properties. In addition, several boxes of descriptive text and images detail all the technical content, motivation, methodology, and primary results/observations of the project. This work involved performing high-throughput density functional theory (DFT) computations on a data set of \(\approx\)500 compounds, yielding their energies of decomposition (measure of stability), electronic bandgaps, and theoretical photovoltaic efficiencies. To determine suitable semiconductors for use as absorbers in solar cells or in related optoelectronic applications such as LEDs, lasers, photodiodes, and sensors, one must screen materials that are stable, have a suitable bandgap within the visible spectrum, and a high absorption efficiency. The computational expense of accurate DFT (hours to days of HPC time needed for a single compound) and the very combinatorial nature of the ABX\(_3\) perovskite chemical space (where A and B are cations while X is a halogen anion) given the number of options for A/B/X species and prospects of alloying at each site, DFT cannot endlessly be performed for millions of candidates, and needs to be coupled with ML models that can make predictions in mere seconds. Thus, using one-hot encoded vectors of the ABX\(_3\) composition as well as elemental properties such as electron affinities and ionic radii of the A/B/X species, we trained predictive models using various linear and nonlinear regression algorithms, leading to highly accurate predictions of the three properties of interest for any number of new perovskite compositions.

Figure 4
figure 4

Outline of the reactive code notebook on the perovskite design work, showing some code snippets and density functional theory (DFT)/machine learning (ML)/genetic algorithm (GA) plots. PCA, principal component analysis; RFR, random forest regression; KRR, kernel ridge regression; GPR, Gaussian process regression; RMSE, root-mean-square error; HPO, hyperparameter optimization; CV, cross-validation; PV, photovoltaic.

A detailed outline of the perovskite DFT-ML notebook is provided in Figure 4 as a flowchart, along with accompanying screenshots of code and plots showing some of the results. The workshop provided the attendees with a brief background of this project, introducing the data set and ML approaches involved, before delving into details with the help of the notebook. The notebook begins with a description of the project, references to relevant publications, and acknowledgment of the authors who contributed to it, followed by an introductory code cell that imports all necessary Python packages. The data are next read from CSV (comma separated value) files and formatted as a pandas dataframe25 that includes necessary compound labels, computed properties, and all descriptor dimensions. One of the concepts explored in this work is “multi-fidelity learning,”26 which involves generating data from two different levels of theory within DFT, namely the semilocal GGA-PBE functional (or PBE),27 which is cheaper but has low accuracy compared to experiment, and the hybrid HSE06 functional (or HSE),28 which is more expensive but also more accurate; while the PBE data set contains nearly 500 points, the HSE data set contains half of that. The next code cells present a visualization of the two data sets in terms of one property plotted against another, a PCA-based 2D projection, and Pearson coefficients of linear correlation calculated between descriptor dimensions and properties, thus providing qualitative insights into the effect of different A/B/X species and their specific elemental properties on increasing or decreasing the perovskite stability and optoelectronic properties.

The next code cells involve splitting the data randomly into training and test sets, followed by rigorous training of three types of regression models individually for each property: elastic net regression (ENR, representing a linear model), random forest regression (RFR), and kernel ridge regression (KRR). The latter two techniques represent some of the most commonly used nonlinear regression algorithms in materials informatics. The training process for each algorithm involves defining a grid of hyperparameters, optimizing them over the training set based on minimizing the validation error from a fivefold cross-validation technique, and making final predictions for all training and test set datapoints. The subsequent code cell visualizes model performance in the form of parity plots showing ML predictions on the y-axis and actual DFT ground truth on the x-axis, as well as root mean square errors (RMSEs) on the training and test set. It should be noted here that the choice of ENR, RFR, and KRR are arbitrary and simply for demonstrating larger concepts, whereas the RMSE is only one type of error metric that was used here, but could easily be replaced by the mean absolute error or R\(^2\) value. Regression models are trained for the PBE and HSE data sets individually, as well as for the PBE and HSE data sets combined into one: the latter uses the principles of multi-fidelity learning, wherein correlations between properties from the two levels of theory and better predictions for the larger PBE data set are exploited to improve the prediction accuracy for HSE.

The punchline of the notebook comes in the form of how the best predictive models could be used for discovering promising new candidates. This has been shown in two different ways:

  • Enumeration of thousands of new compounds followed by prediction of their properties and screening based on suitable property values (e.g., a negative decomposition energy and bandgap between 1 and 2.5 eV). A markdown in the notebook that follows the ML models presents graphics showing screening more than \(\approx\)18,000 compounds leading to 392 materials that satisfy multiple objectives of interest.14

  • Performing inverse design by exploiting the DFT-ML surrogate models to iteratively produce new candidates that take you closer to the desired set of properties; this is demonstrated here using a genetic algorithm (GA),29 where starting from a random population of perovskite compositions, new generations are continuously produced, using operations such as elitism, crossover, and mutation, so as to minimize a multi-objective function that ensures chemical feasibility (maintain stoichiometry, fractions of multiple components at A/B/X sites must sum to 1), a negative decomposition energy, bandgap between 1 and 2.5 eV, and a PV efficiency as high as possible. The GA framework includes parameters that can be tuned and models can be run for any number of generations, at the end of which a list of best candidates is obtained.

Enumeration and screening as well as the GA-based design is performed using both PBE- and HSE-level estimates, with knowledge of the general accuracies of both functionals used for determining the suitable property ranges. Final cells of the notebook present a visualization of the GA results in terms of the objective function plotted against the GA generation as well as the chemical constituents occurring at A/B/X sites in the best performing perovskite compositions. Subsequently, the same perovskite data set is utilized to demonstrate neural network regression as described in the “Neural networks” section. The many concepts explored and explained through this notebook – types of descriptors (compositional versus physical properties), visualization and clustering of data, extracting correlations, splitting data and optimizing regression models, etc. – would eventually serve the materials informatics challenge provided to all attendees at the final session of the tutorial.

Neural networks

Building on the foundations laid in the first session of the tutorial, the next reactive code notebook introduces the audience to neural network regression, utilizing the same perovskite data set that the audience is now familiar with. The inputs again are descriptors that encode compositional information and elemental properties, and the output is chosen to be the DFT predicted bandgap at the PBE level. This notebook uses the TensorFlow30 and Keras31 libraries to help audiences quickly set up and train neural networks, with minimal code. The notebook begins by obtaining the raw data set afresh as a CSV file, and next, the data are converted to a Pandas dataframe and “featurized” as in the previous notebook, creating input and output columns. The next code cell carves a train, validation, and test set from the dataframe, reiterating to the audience the importance of these data sets. Following this, the notebook sets up a simple neural network with a few hidden layers, with the instructor encouraging the audience to try out their own variations to this simple architecture. The flexibility of reactive code notebooks allows each audience member to rapidly get comfortable with trying different neural network architectures. After training the model, subsequent code cells visualize the training via learning curves and check for common neural network training pitfalls such as overfitting. Finally, the neural network makes PBE bandgap predictions on test data, encouraging the audience to compare their predictions against previous regression models used in the workshop. A snippet of the neural network code is presented in Figure 5a.

Following the regression demonstration, we slightly change tracks and introduce the audience to a classification problem that is solved using neural networks. This problem uses images as inputs, and is aimed at materials scientists working with microstructural data such as electron/optical microscropy images or diffraction data, introducing them to the idea of learning from image data using neural networks. To demonstrate image classification using neural networks, the reactive code notebook uses the FashionMNIST32 data set, which consists of thousands of clothing images, and attempts to classify them into their correct categories (T-shirts, trousers, boots, etc.). Because this data set is new to the audience, the instructor spends some time introducing and pre-processing the data, making sure the inputs and outputs are clear. Then, the concept of “convolution operations” to learn features from images is illustrated, before showing the code for a neural network composed of convolutional layers. Once again, audience members are encouraged to try out their own variations of the network architecture by modifying parameters. As with the regression tutorial, the model is trained, after which the notebook visualizes learning curves, and makes predictions on the test data set.

These Jupyter notebooks can be found on nanoHUB,24 along with other basic machine learning tutorials.33

Figure 5
figure 5

(a) Snippet of code from the neural network notebook. (b) Snippet of code from the active learning notebook. (c) Example plot showing the success of determining an unknown function by sampling new points using active learning.

Gaussian processes and active learning

A separate reactive code notebook was used to introduce the audience to Gaussian processes (GPs), a Bayesian technique.23 The goal of Bayesian methods such as GPs is to robustly determine the following: given the data so far and some knowledge of the problem, what can be concluded about new locations in the domain? GPs are particularly well suited to smaller data sets, such as could be encountered in experimental measurements or computationally intensive simulations. GPs natively output uncertainty information with the predictions, which make them natural choices for “active learning” campaigns, where the objective is to iteratively add new datapoints to minimize overall uncertainty and maximize reward. This Jupyter notebook tutorial aims to provide intuitions about the implications of how the GPs are used, so that they can be applied in the next reactive code notebook on active learning (AL) for autonomous experimental materials science.

The notebook presents GPs as a collection of related probability distributions over the input domain. Knowledge about the problem (the priors in Bayesian inference parlance) is specified by the design of the mean and kernel (also called covariance) functions. The mean function depends on the global position in the domain (e.g., a function such as \(f(x) = x^2\)). In contrast, the kernel functions only depend on the relative similarity between locations in the input domain. The next cells present the implications of choosing some popular kernels. For example, using the radial basis function kernel implies that the user expects the function to be smooth (i.e., infinitely differentiable). However, choosing the Matern \(\frac{3}{2}\) kernel would imply that discontinuous jumps are expected in the slope of the function. The notebook concludes with an illustration of how GPs could be used for regression. In addition to executing code alongside the instructor, the workshop attendees were encouraged to use the notebook for generating example data and training regression models with different choices of kernels and hyperparameters. Further examples are provided showing common pitfalls and demonstrating how to propagate the measurement uncertainty.

In the next section, a second notebook builds upon the GP background to demonstrate AL with a simulated autonomous experiment campaign. This notebook covers the basic concepts of several common decision-making policies that use both the predictions of the function of interest and the uncertainties in those predictions. Presenting this material as a notebook allows the attendees to run the simulated AL campaigns. They were encouraged to modify the AL policy and the model design, run the campaigns, and compare and contrast the obtained results. This notebook concludes by presenting several performance metrics for the success of the AL campaigns. These metrics evaluate the success of the AL campaigns toward the goals of either learning a global map of the function, or learning global optimum. A code snippet from the AL notebook and a plot showing results at a particular step of the AL campaign are presented in Figure 5b–c, respectively.

Raising curiosity and confidence via competitions

Organizing an in-class competition after the tutorials provides participants with hands-on experience to apply ML and data handling techniques introduced during the workshop, without answers provided. This learning format encourages participants to follow their curiosity and explore different ML methods, serving as a bridge to apply similar methods to their own research challenges. There are several advantages to hosting a data science competition along with workshops. First, it provides participants with a platform to receive timely feedback and benchmark their answers with others. Having completed a materials data challenge independently also boosts newcomers’ confidence in further research in materials informatics. Launching the competition during or immediately after the tutorials allows for fresh memories, and an in-person Q&A session is particularly friendly for beginners, who may have questions such as installing a Python package or checking data downloading bottlenecks, which are easier to resolve when an instructor takes over the computer screen.

Both competitions we ran at MRS Fall Meeting Tutorial Days in 2021 and 2022 consisted of an introductory session explaining the data challenge’s significance in materials science, followed by sharing the data and a Google Colab notebook to help participants get started. In the 2021 workshop, we simulated a data set where participants were asked to predict y values (materials property) using x values (composition) by applying supervised ML and AL techniques. In the 2022 workshop, the competition was hosted in partnership with the Toyota Research Institute and the University of Maryland. Participants were given an industry-relevant battery degradation challenge and asked to predict battery cycle life using an experimental battery testing data set from literature.34 Through this partnership, MRS tutorial participants have the opportunity to be invited as finalists for the Battery Informatics & ML Hackathon, which is open to a wider audience, including battery researchers and ML professionals. Because the submission platform and live leaderboard were open to all, participants from diverse academic and research backgrounds can benchmark their solutions against each other. To achieve the educational goals of attracting more participants from diverse fields while ensuring the fairness of the competition, two separate Judges’ awards were set up to reward best models received from MRS tutorial attendees and community participants, respectively.

Outlook and next steps

Informatics-based approaches are the norm in all materials design endeavors today, and the importance of educating materials scientists in essential concepts and practical aspects of AI/ML cannot be overstated. There are numerous virtual and in-person workshops being regularly organized by institutions and professional societies the world over to address gaps in materials informatics training. Furthermore, such topics have made their way into existing and new courses in many materials engineering departments, as it becomes imperative that students learn and apply data science methodologies to their research. Through this article, we attempted to shed light on why, how, and what we should be teaching beginners to put them on the path to becoming seasoned AI/ML practitioners. MRS conferences represent the foremost gathering of materials researchers from across the world, and there is no better venue to host an introductory workshop like what is described in detail in this article. This workshop is planned for many more upcoming meetings, always with a rolling list of instructors, and necessary updates and additions as the state of the art evolves and new techniques and capabilities come to light. Once again, we will present this tutorial at the 2023 MRS Spring Meeting and beyond as we continue to help normalize and popularize materials informatics education and training.