Several authors have proposed the application of machine learning techniques to solve problems in geochemistry. Table 1 summarizes these research works.
In the following research works, the authors compared several supervised learning approaches to create optimal classifiers and predict new samples.
In Itano et al. (2020), the authors proposed the use of Multinomial Logistic Regression to discriminate the source rock of detrital monazites. They used 16 elements (La, Cr, Pr, Nd, etc.) from samples of detrital monazites from African rivers. All possible combinations were created using the 16 elements (65,535 different combinations) to obtain the models with the best discrimination. Accuracy values by number of elements were compared and the results showed that the highest accuracy (97%) is obtained with 8 to 10 elements.
In Maxwell et al. (2019), the authors applied 3 classification algorithms (Random Forest, Gradient Boosted Machine, and Deep Neural Network) to predict altered and non-altered lithotypes. They used a dataset with geophysical log data from 1,230 coal samples taken from 263 boreholes from the Leichhardt Seam of the Bowen Basin in Eastern Australia. The dataset was randomly splitted into an 80% training set and 20% testing set. The Random Forest model performed the best with average results of: 99% precision, 99% recall, and 99% F-score for the training set; for the testing set: 97% precision, 93% recall, and 95% F-score. Only 11 classifications out of 1,230 samples were identified wrongly.
In Hasterok et al. (2019), the authors compared and discarded different approaches (Discriminant Analysis, Logistic Regression Analysis, Decision Trees, etc.) to develop an accurate protolith classifier. A dataset was created and normalized extracting 9 major elements (SiO2, TiO2, Al2O3, MgO, etc.) from 533,360 samples: 497,401 igneous samples and 35,959 sedimentary samples. The samples were taken from a global dataset of rock major elements. The results showed that the best classifier was an Ensemble Trees model (RUSboost) with an accuracy of 95% true igneous and 85% true sedimentary.
In Ueki et al. (2018), the authors compared 3 classification algorithms (Support Vector Machine, Random Forest, and Sparse Multinomial Regression) for the discrimination of volcanic rocks according to 8 tectonic settings. The dataset was obtained from 2 global geochemical databases: PetDB and GEOROC. It was composed of 24 geochemical data and 5 isotopic ratios (29 features) and contained 2,074 samples. The results showed that the 3 methods presented an accuracy higher than 83% in most of the classes. Although the accuracy of Sparse Multinomial Regression was the lowest, it was the most useful method for generating geochemical signatures that were easy to interpret and analyze.
In Petrelli and Perugini (2016), the authors used Support Vector Machines to classify rock samples according to 8 different tectonic settings. The dataset was composed of major elements, trace elements, and isotopes, from 3,095 samples. They classified the samples using major elements, trace elements and isotopes separately, and the combination of all (4 experiments total). The results showed that the combination of the major elements, trace elements, and isotopic data offers the highest accuracy (93%) than separately: 79% for major elements, 87% for trace elements, and 79% for isotopes.
In the following research works, unsupervised learning was used to find patterns in geochemical data. They also show the relationship between data reduction and clustering.
In Ellefsen and Smith (2016), the authors applied a clustering method based on a hierarchy to interpret geochemical data from the soil of Colorado, USA. The dataset was cleaned based on the concentration percentage of the elements, and PCA. The final dataset contained 959 samples with 22 principal components. The results of the hierarchy method were 2 clusters, each one with elements in common. Cluster 1 contained elements commonly enriched in shales and other fine-grained marine sedimentary rocks. Cluster 2 contained elements commonly associated with potassium feldspars or felsic rocks. The plotted results were consistent with the geological units in the area.
In Jiang et al. (2015), the authors applied PCA and Hierarchical Cluster Analysis (HCA) to study the geochemical processes that control the presence of As in groundwater in the Hetao basin, Mongolia. 90 groundwater samples with 22 geochemical parameters (Ca, Cl, Na, NO3, pH, etc.) were collected from the area. PCA was applied to the samples and they identified 4 major principal components that explain 78.2% of the variance of the original data. The components were the input for the HCA method. The results showed 3 clusters. In Cluster 1, high As concentrations correspond to high P concentrations in flat plain. In Cluster 2, samples are affected by lithological and redox factors. In Cluster 3, low As concentrations correspond to low P concentrations in alluvial fans.
In Alférez et al. (2015), the authors compared PCA and Geographic Information Systems (GIS) techniques with the K-Means clustering algorithm. The authors used geochemical data from 800 rock samples from an area of Southern California. The approaches were compared in terms of 4 geochemical factors: SiO2, Sri, Gd/Yb, and K2O/SiO2. The results showed that the K-Means algorithm gives results very similar to the ones obtained with GIS and PCA.
According to the research works presented above, supervised learning is more used than unsupervised learning in geochemistry. In general, research works on supervised learning use the following methodology: cleaning and selecting the data, splitting the dataset, training the algorithms, and showing the results. However, several of these research works lack specific activities for data preparation, such as imputation, removing null values, or sample balancing. Also, they tend to leave out other metrics besides accuracy to evaluate and compare the models. There is no work proposing a machine learning platform or an autoML tool for geochemical analysis.