1 Introduction

The global community is presently facing considerable challenges in obtaining natural resources (e.g., minerals, energy, water, foods, etc.) while risking unwanted environmental effects of extreme events, such as global warming, loss of biodiversity, and natural/anthropogenic hazards (Sorkhabi 2022). When confronted with these challenges, the sustainable development of our world will require a deeper understanding of how the Earth operates in order to better tackle or predict extreme events (which must be better understood in terms of both the present and in deep-time Earth) (Cheng 2022). Progress in science and technology is a primary driving force in societal development, as they help provide vital solutions to challenges. In the past decade, big data and artificial intelligence (AI) have significantly altered the lifestyles and overall prosperity of society, while also influencing a fourth scientific paradigm—data-intensive or data-driven science. This fourth paradigm follows the traditional paradigms of experimental, theoretical, and computational sciences (Bell et al. 2009; Hey et al. 2009). Driven by ever-expanding arrays of data and armed with digital technologies (e.g., AI, big data analysis, and supercomputing), geoscientists are advancing the traditional approaches of thinking in both industry and academia (Bergen et al. 2019; Sun et al. 2022).

Historically, geosciences have progressed using inductive, knowledge-driven (or theory-guided) models by first generating a hypothesis and then collecting evidence to prove or disprove these hypotheses (Agterberg 2020). In geosciences, knowledge-driven models rely on logical reasoning based on prior knowledge gained by geologists, such as plate tectonics, evolutionary theory, and mineral deposit models. However, constructing prior geoscience knowledge is subject to the paucity of (preserved or exposed) rocks and limited observations, which hinder inferences and knowledge discovery. Nonetheless, data-driven science, based on abduction with big data, offers an opportunity for discovering new knowledge through AI techniques (e.g., machine learning and knowledge graphs) without a specific hypothesis (or theory) in mind. The advantages of data-driven discovery include transforming human learning by itself into an integration of both human learning and machine learning, as well as providing answers to known questions and formulating unknown answers to unknown questions (Cheng and Zhao 2020).

In general, data-driven science consists of several basic activities, including data capture, data curation, and data analysis (Hey et al. 2009). Data capture is the basis of data-intense science. Over recent decades, the rapid development of remote and in situ sensing techniques and the subsequent swift deployment of these technologies have led to the explosive growth of geoscience data for both industry and academia. Simultaneously, researchers have accumulated vast amounts of engineering and scientific data; however, these legacy data, for example, might represent gold deposits yet to be mined. Data curation, including data cleaning, data aligning, and converting meta-data, aims to build a data life cycle and form the data basis for conducting data-driven discovery via AI techniques. Considerable efforts have already been dedicated to quantifying geosciences. Examples include Macrostrat, EarthChem, along with many other data portals and databases. Data analysis and mining are key features of the fourth paradigm of scientific research. These types of approaches are imperative in tackling data deluge through cutting-edge AI techniques, including machine learning, deep learning, and knowledge graphs.

Although the geoscience community has been slow in adopting AI and big data techniques (relative to other disciplines), data-driven discovery is gaining popularity amongst industry, government, and academia. This is reflected in the ever-increasing number of programs initiated, conferences and workshops convened, and in the number of related published research papers. For example, the United States Geological Survey (USGS) proposed four innovation areas in their future work, including big data, critical mineral resources, ecological resources, and natural hazards. Furthermore, the USGS suggested several potential directions for developing data-driven geosciences (Bristol et al. 2012). In 2019, the Deep-Time Digital Earth (DDE) big science program was initiated by the International Union of Geological Sciences (IUGS) with the intention of reconstructing the co-evolution of life, geography, matter, and climate over the 4.6 billion years of Earth history using data-driven abductive discovery, as well as by identifying the spatiotemporal distribution of global mineral and energy resources via AI techniques (Cheng and Zhao 2020; Wang et al. 2021).

Over the past decade, the data-driven discovery paradigm has received considerable attention for solving a variety of geoscience questions and challenges (Sun et al. 2022). These solutions range from addressing fundamental questions of Earth science to technical bottlenecks in engineering, such as the exploration of mineral and oil/gas resources. As a practical application, data-driven techniques, such as machine learning and deep learning, have played crucial roles in improving data processing methods and approaches to interpreting data in the field of remote sensing (Zhu et al. 2017), applied geophysics (Wang and Chen 2021; Yu and Ma 2021), and mineral prospectivity modeling (Chen et al. 2022c; Cheng and Agterberg 1999). In addition to transforming industries, data-driven science is also beginning to play an important role in advancing scientific discoveries of complex Earth systems, such as earthquake forecasting (Mousavi and Beroza 2022), global climate change (Reichstein et al. 2019), planetary interior structure (Wilding et al. 2022), the evolution of mass, life, and climate in the early Earth (e.g., Chen et al. 2022b; Chiaradia 2014; Fan et al. 2020; Hazen 2014; Keller and Schoene 2012; Puetz et al. 2018), and the search for extraterrestrial life (Ma et al. 2023). As yet another example, a high-resolution history of Earth’s atmospheric oxygenation was reconstructed using machine learning and big data from mafic igneous rocks for the past 4 billion years (Chen et al. 2022a).

The special collection on “Data-driven Discovery in Geosciences” gathers six research papers that showcase new developments and novel applications of data-driven AI techniques in multiple aspects of geosciences. In Sect. 2, we summarize the highlights of these papers in addressing the specific challenges in different domains when using data-driven AI techniques. In Sect. 3, we outline future challenges for facilitating data-driven Earth science and then speculate about possible directions for these advances.

2 Summary of Articles in This Special Issue

The paper entitled “Geographically Optimal Similarity” by Song (2022) develops a mathematical model of geographically optimal similarity (GOS) for accurate and reliable spatial prediction of geological variables (e.g., trace elements) based on the Third Law of Geography—namely, the geographical similarity principle, which describes the comprehensive degree of approximation of a geographical structure instead of alternative explicit relationships between variables. GOS employs a small number of samples and then derives better spatial predictions compared to the traditional methods. An R package named “geosimilarity” was developed for GOS-based predictions and uncertainty assessments. This work demonstrates the potential for applying the GOS model to spatial predictions, such as geochemical mapping in environmental assessments and mineral exploration.

The paper entitled “Revealing Geochemical Patterns Associated with Mineralization Using t-Distributed Stochastic Neighbor Embedding and Random Forest” by Shi et al. (2022) focuses on mineral prospectivity modeling using both unsupervised and supervised learning algorithms. A hybrid model combining t-distributed stochastic neighbor embedding (t-SNE) and the random forest (RF) method addresses data redundancy and the curse of dimensionality in geochemical mapping for mineral exploration. The application to the exploration of gold deposits in the northwestern Hubei Province of China demonstrates that the hybrid model combining t-SNE and RF can identify geochemical anomalies associated with gold mineralization efficiently. The high agreement with known gold deposits suggests that the areas targeted by t-SNE + RF can guide future mineral exploration in this area of study.

The paper entitled “Robust Optimal Well Control using an Adaptive Multigrid Reinforcement Learning Framework” by Dixit and Elsheikh (2022) focuses on optimal control problems using cutting-edge data-driven deep learning techniques. An adaptive multigrid reinforcement learning (RL) framework was introduced to address the computational challenge of robust control policies for uncertain, partially observable well control attributes. RL-based control policies are initially learned using computationally efficient low-fidelity simulations with coarse grid discretization of the underlying partial differential equations. The proposed RL framework was demonstrated by using the state-of-the-art Proximal Policy Optimization algorithm. Its application to two cases of well control problems suggests significant gains in computational efficiency. The improved efficiency is estimated to be between 60 and 70% when compared to single fine-grid methods.

The paper entitled “Ensemble and Self-supervised Learning for Improved Classification of Seismic Signals from the Åknes Rockslope” by Lee et al. (2022) focuses on geohazard monitoring using data-driven deep learning techniques. The fast and reliable identification of seismic events and their classification provide crucial information for monitoring rock slopes and early warning systems for potential rock slides. In this paper, a classifier for seismic geophone data was built to distinguish between different types of microseismic events using deep convolutional neural networks. With ensemble learning, the classification accuracy has been improved in comparison to the aggregation for a form 1 single spectrogram. This work also demonstrates the value of applying self-supervised learning. This is particularly relevant for datasets with insufficient labeling.

The paper entitled “Random Noise Attenuation by Self-supervised Learning from Single Seismic DatRandom Noise Attenuation” by Wang et al. (2022) focuses on reflection seismic data denoising using deep learning algorithms in the field of oil/gas exploration. A dropout-based self-supervised (DSS) deep learning method was introduced for single seismic data random noise attenuation to address the challenges arising from limited clean labels (i.e., noise-free) when using supervised algorithms in practice. Compared to the traditional f–x deconvolution and deep image prior methods, the DSS method achieves better denoising results for preserving details of synthetic seismic data and field data. Moreover, numerical experiments indicate that the DSS method is stable for seismic denoising and reduces the over-fitting phenomenon.

The paper entitled “Construction and Application of a Knowledge Graph for Iron Deposits Using Text Mining Analytics and Deep Learning Algorithm” by Qiu et al. (2023) explores one of the frontiers for applying AI techniques in geoscience, that is, building a knowledge graph for facilitating knowledge discovery. A deep learning model was introduced to automate the extraction of geological entity relations from ore deposits, while creating a prototype question–answer system (Q&A) for ore-forming circumstances. This approach establishes annotation specifications for iron ore deposit entity relationships and a human-annotated corpus of geological entities of iron ore deposits. The constructed geological knowledge graphs were applied to analyze the mineralization characteristics of the Daye iron deposits in China.

3 Outlook

This special collection showcases a variety of data-driven research and/or applications in geosciences including seismic data processing, mineral prospectivity modeling, environmental pollution assessments, and geohazard monitoring, by using many data-driven AI techniques, such as unsupervised, supervised, self-supervised, and reinforcement learning. While recent advances in big data and AI approaches offer wonderful new opportunities for accelerating scientific discoveries and predictions via abductive, data-driven models and techniques, we face unique challenges specific to the geoscience domain, in addition to common difficulties pertaining to data capture, storage, searching, sharing, and visualization. The first challenge arises from transforming complex geoscience data into usable information because of the heterogeneity of the multivariate data as well as complex patterns obscured in data. The second challenge stems from converting information into knowledge due to gaps between predictions and the current understanding. Geoscience big data can be a gold mine. Whether this gold mine can be discovered by geoscientists depends on how effectively we overcome these challenges. Simply put, tackling the above challenges calls for domain-specific mathematical (statistical) models, advanced machine learning algorithms capable of learning with limited, weak, or biased labels, as well as a combination of data-driven and knowledge-driven models (Karniadakis et al. 2021). Given that many AI techniques are deeply rooted in mathematical and computational models (De Iaco et al. 2022; Dramsch 2020), it is an important mission for mathematical geoscientists to seize the strategic opportunity of the ongoing data revolution and bridge the gap between geoscientific data and AI models to further promote the new paradigm of data-driven Earth science research. Overall, while the research and development of data-driven discovery in geosciences are still in their infancy, we envision this new science paradigm to play ever-greater roles in the future. Such roles include but are not limited to protecting our society from various geohazards (e.g., major earthquakes, explosive volcanos, and landslides), providing resources for future generations, tackling environmental degradation and climate change, and searching for habitable planets.