Seismic velocity modeling in the digital transformation era: a review of the role of machine learning

Seismic velocity modeling is a crucial step in seismic processing that enables the use of velocity information from both seismic and wells to map the depth and thickness of subsurface layers interpreted from seismic images. The velocity can be obtained in the form of normal moveout (NMO) velocity or by an inversion (optimization) process such as in full-waveform inversion (FWI). These methods have several limitations. These limitations include enormous time consumption in the case of NMO due to manual and heavy human involvement in the picking. As an optimization problem, it incurs high cost and suffers from nonlinearity issues. Researchers have proposed various machine learning (ML) techniques including unsupervised, supervised, and semi-supervised learning methods to model the velocity more efficiently. The focus of the studies is mostly to automate the NMO velocity picking, improve the convergence in FWI, and apply FWI using ML directly from the data. In the purview of the digital transformation roadmap of the petroleum industry, this paper presents a chronologic review of these studies, appraises the progress made so far, and concludes with a set of recommendations to overcome the prevailing challenges through the implementation of more advanced ML methodologies. We hope that this work will benefit experts, young professionals, and ML enthusiasts to help push forward their research efforts to achieving complete automation of the NMO velocity and further enhancing the performance of ML applications used in the FWI framework.


Introduction
The velocity of a medium is a crucial step to obtain an accurate seismic image. Inaccurate velocity typically leads to incorrect positioning of the reflectors and hence causes a discrepancy in the image. Many techniques have been developed over the years to ensure a good velocity estimate of the subsurface. A common practice in seismic processing is to obtain the normal moveout (NMO) velocity from the common midpoint (CMP) gathers. The NMO velocity flattens the hyperbola in the data. Velocity search by semblance analysis (Yilmaz 1987) is the most common method used to estimate NMO velocity. Since it depends on manually picking the maximum energy semblance of the stacking velocity, this technique requires extensive human intervention. Hence, it is time-consuming, especially for large 3D volumes. Besides that, NMO velocity is often not accurate as it is based on the assumption of lateral homogeneity and fails to estimate complex structures. It provides a smooth velocity courtesy of the interpolation between the CMP gathers.
Velocity inversion is often implemented to obtain a more accurate velocity model. Full-waveform inversion (Tarantola 1984) is a common inversion method that provides a high-resolution velocity model by minimizing the least square misfit between observed data from the field and the modeled version. It is an ill-posed nonlinear optimization problem. Starting from an initial velocity model such as an NMO velocity, FWI updates the velocity iteratively until it converges to hopefully global minimum. However, due to the lack of low frequencies in the data, it often converges to a local minimum, especially with a poor starting model. Regularizing the inversion with constraints and prior information has been proven to achieve better convergence (Asnaashari et al. 2013;Kalita et al. 2019).
Recently and due to the advancements in computational resources and the availability of state-of-the-art algorithms, there has been a wide interest in machine learning (ML) applications within the geoscience community. These applications include formation tops identification using unsupervised learning techniques (Xuan and Murphy 2007) and seismic facies analysis using artificial neural network (ANN) (Wrona et al. 2018) and formation tops identification using supervised learning techniques (Maniar et al. 2018). Others are fault detection (Xiong et al. 2018;Wu et al. 2019) and salt interpretation (Zeng et al. 2019) using convolutional neural networks (CNN), and seismic attributes selection using ANN (Qi et al. 2020). Additional contributions were made to overcome the limitations of the current velocity building techniques. Most of these efforts were aimed at automating the picking process in the semblance related approaches while assisting FWI by low-frequency extrapolation, gradient manipulation, and regularization by using ML to converge faster. Some other attempts have been made to estimate the NMO or FWI velocity directly from the prestack shots/CMP gathers.
In this paper, we track and review the progressive efforts in the development of velocity models from the traditional empirical and analytical approaches to the use of machine learning techniques. First, we present an overview of some of the ML techniques that were applied in the studies that we reviewed in this paper. These include unsupervised learning technique such as clustering, supervised learning techniques such as deep neural networks, and semi-supervised learning technique such as meta-learning. Then, we discuss the ML applications in estimating NMO velocity. After that, we review the ML applications in the more advanced velocity modeling method, FWI. Finally, we conclude by making recommendations for potential future applications in view of contributing to the digital transformation effort of the petroleum industry.

Common machine learning techniques applied in the reviewed literature
There are three basic machine learning paradigms, namely supervised, unsupervised, and reinforcement learning (Bishop 2006). In the case of supervised learning, a training set, X, is used in the model function, f(X), to build and optimize its relationship with a known target, T. T could be composed of labels (in case of classification) or continuous values (in the case of regression). Using a metric (loss) to estimate the similarity between the model prediction, T � = f (X) , and the actual target, T, the model is optimally tuned to predict the desired output. This can be expressed mathematically by The goal of the tuning process is to keep the loss within a certain threshold or as low as practically possible. Common examples of losses used in training include mean squared error for regression problem and cross entropy for classification. Updating the model during the tuning process is often performed by gradient descent and back-propagation methods. Examples of techniques utilizing this type of learning methods include support vector machine (SVM), decision trees, and artificial neural network (ANN). For the unsupervised techniques, a corresponding target value of T is not available for a set of inputs, X. Rather, the learning goal typically involves discovering optimal groups with similar features in the data (e.g., clustering methods), reducing the dimension of the data (e.g., principle component analysis) or density estimation. Reinforcement learning is learning actions through trial and error based on rewards and punishments system in which the learning algorithm seeks to maximize the reward. An analogy of this type of learning is teaching a dog to sit. We emulate a situation, and the dog will respond in various ways. The dog will be rewarded with food if he sits. Next time when we emulate the same situation, the dog will sit expecting the reward.
A more recent ML technique known as meta-learning or "learning to learn" can be classified as a semi-supervised learning (Maclaurin et al. 2015;Andrychowicz et al. 2016). In meta-learning, the goal is to find the optimal hyperparameters such as loss, learning rate, regularization parameter, initial weights, and activation functions. Depending on the problem, one may choose whichever hyperparameters to optimize as desired and these hyperparameters will be referred to as "meta-variable." A loss, referred to as metaloss, measures the performance of the network or model in terms of the degree of similarities of the model prediction and the actual target values. The algorithm starts by running series of training instances using some ML models. The meta-loss measures how well the ML model has succeeded in predicting the target. It would then propagate the error to update the network parameters. This type of learning is often implemented in nested loops. An inner loop to perform several training steps and an outer loop to optimize the meta-variables. Therefore, meta-learning requires high computational cost compared to a normal training process. Another requirement is that the network parameter should have higher-order derivatives such as the "gradient of the gradient." The following subsections discuss in more detail some of the techniques used by geoscientists in velocity modeling, specifically clustering methods and ANN.

K-means and DBSCAN clustering
Clustering, a type of unsupervised learning, is a technique that identifies groups (or clusters) with similar features in the data (Gan et al. 2007). The goal for clustering algorithm is to partition the dataset into clusters (i.e., groups). Researchers have suggested different methods to choose the optimal number of clusters (e.g., Maclaurin et al. 2015;Salvador and Chan 2004). The user often determines that by trial and error. Various clustering algorithms such as K-means, expectation maximization, density-based spatial clustering (DBSCAN), and fuzzy clustering have been developed (Bradley et al. 1998;Rokach and Maimon 2005;De Oliveira and Pedrycz 2007). This section focuses on only the algorithms for K-means clustering and DBSCAN since they are applied in velocity applications.
K-means, where K represents the number of clusters, is one of the most commonly used clustering methods (Bishop 2006). It is implemented in the following steps: 1. Initialize cluster centroids. 2. Assign the data points to the closest clusters based on their proximity to the centroids. 3. Update the cluster centroids based on the mean of the data points within the cluster.
4. Repeat steps 2 and 3 until the centroids converge, that is, when the difference between the new and the current centroids is zero or less than some tolerance. Figure 1 shows an example of this iterative process with three clusters from input data (a) to the last iteration.
DBSCAN is a density-based clustering algorithm (Ahmed and Razak 2016). By defining a radius, , and a minimum number, minPts, of points, DBSCAN classifies the points into three categories: 1. Core points: the points that contain more than minPts within the radius, . 2. Border points: the points that contain less than minPts within the radius, , and they are neighbor of core points. 3. Noise (outlier) points: the points that belong to neither core points nor border points. Figure 2 shows an example of this classification using three as minimum number of points (minPts).
The workflow for the method is shown in Fig. 3 and composed of the following steps: 1. Select random points that are not assigned to a cluster or noise. 2. Compute the neighborhood points within the distance, . 3. Assign the points using the following conditions: a. If the number of points within is larger than minPts, it becomes a core point. b. If the number of points within is less than minPts and it is in the neighborhood of a core point, it is a border point. c. Identify the point as noise if none of the above conditions is satisfied.
4. Assign the class of the core point to its border point. 5. The process is repeated until all data are assigned to a cluster or identified as outlier.

Artificial neural network
ANN is a powerful learning tool to approximate nonlinear functions. It is considered a supervised learning technique as it requires providing the true outcome for each point in the training data (Bishop 2006). The simplest form of the ANN is the fully connected layers (FC). FC is composed of three main layers: an input layer, which receives the input features; one or more hidden layers, which perform all the computations; and an output layer, which produces the final results. Each of the hidden layers consists of neurons, which are connected to the previous and next layers by weights. Inside a single neuron, the input vector is multiplied by the weights and a summation task is performed. To produce a final output of the neuron, an activation function is applied to the summation. Figure 4 shows a typical FC structure with three hidden layer for illustration.
Recently, more advanced ANN algorithms have been developed. These include convolutional neural network (CNN) (LeCun et al. 1995) and recurrent neural network (RNN) (Hochreiter and Schmidhuber 1997). CNN uses local convolutional filters to extract the spatial features from the inputs. It is widely used in image processing, object detection, image segmentation, and classification problems. RNN uses a memory variable that stores information from previous inputs in the new prediction. It is widely used for time series problems. Some of the common applications for RNN are natural language processing, language translation, and time series forecasting. RNN suffers from vanishing or exploding gradient problem. The structure of a version of the RNN algorithm, known as long short-term memory (LSTM), addressed this issue and has been commonly used (Hochreiter and Schmidhuber 1997).

Machine learning techniques applied to velocity estimation
This section discusses two major types of velocity models: NMO and velocity models obtained by FWI. The NMO velocity flattens the hyperbolas in the CMP gathers. NMO velocity is not accurate in most cases, as it is based on lateral homogeneity of the subsurface. Despite this, it is used for an   Each circle represents a single neuron initial estimate for the subsurface model. FWI used iterative method to update the velocity model and obtain more accurate and higher resolution velocity. However, it should start from a good initial model. FWI is usually implemented by inverting for the low frequencies of the data and gradually progressing to the higher ones to ensure convergence. However, these low frequencies are often missing which make it challenging to FWI to converge to global minimum.

NMO velocity analysis
Traditionally, NMO velocity is obtained by creating semblance spectrum panels for several CMP gathers. The semblance panel consists of a range of velocities in one axis and the two-way travel time in the other. The highest energy semblance, which indicates the velocity at a particular time, is typically picked manually. With the help of machine learning, the picking process has been partially automated. Different strategies have been implemented with unsupervised learning such as clustering and supervised learning with ANN. Using ANN for automatic velocity picking is not new. Schmidt and Hadsell (1992) and Fish and Kusuma (1994) are few of the pioneers that used ANN for velocity analysis.
The networks at that time were shallow and only able to extract some local information of the velocity semblance.
Several researchers suggested the use of K-means clustering as an auto-picking method for velocity analysis. Since seismic data are often contaminated with noise, Smith (2017) suggested using different attributes such as semblance, AVO auto-picking, and continuity of the gathers across offsets for clustering. This ensures more robustness. To further ensure that only the points around the high semblance are considered, Wei et al. (2018) proposed using some constraints such as adding a few manually picked semblances as a guide. Chen (2018) applied a threshold to keep the high-energy points. He implemented the K-means algorithm on the Gulf of Mexico (GOM) data. He achieved good picking except at some deeper parts where the spectrum is diffused. Bin Waheed et al. (2019) later compared the performance of K-means with the DBSCAN clustering algorithm. Their findings suggested that the lower value of K in the K-means algorithm would result in no picks, while the larger value is likely to lead to error (Fig. 5a). For DBSCAN, the values of the picks for the tested radii are similar as shown in Fig. 5b. They made a comparison between the picks for K-means (K = 5), DBSCAN ( = 0.02) and the true velocity (Fig. 5c). Using an alternative approach, Ma et al. (2018) formulated the problem as a regression type rather than using the semblance picks. To achieve the regression objective, they used CNN to estimate the NMO velocity from the pre-stack CMP gathers directly. A predefined range of velocities were applied to the CMP gathers to flatten them. They trained the CNN model by taking mini-batches from the CMP gathers and outputting a number indicating the velocity errors. For example, the output is 1 if the CMP is flat, 0.9 for overcorrection, and 1.1 for under-correction. The velocities corresponding to an output equal to 1 were selected for the velocity model. The method produced promising result when it was applied to the Marmousi model using velocities in the range of 0.9 and 1.1 of the true velocity. Biswas et al. (2019b) suggested using RNN to obtain the velocity semblance picks from pre-stacked CMP gathers as a regression problem. The velocity governed the spread of the hyperbolas in time and offset. This means that the information needed to estimate the velocity at a particular time step is the neighboring temporal and spatial information. Therefore, they considered windows of offset size NX and time 2 N as illustrated in Fig. 6. On the left is the CMP gather with offset NX. The blocks of data from the CMP gather was used for creating a single instance of a mini-batch (multiple sequences). The right panel is the corresponding NMO velocity pick. The velocity was estimated at the centre of the window shown in magenta color. RNN took the input X sequentially from X i to X p where i represents the time-step index and p is the number of time-steps. A fully connected (FC) layer was applied after the RNN for projecting the output to the desired dimension. It is worth mentioning that during the training, mini-batches were used to update the weights in a single iteration. The method was implemented on pre-stacked 2D data provided by Geofizyka Torun Sp Z o.o. in Poland and available in the public domain (https:// wiki. seg. org/ wiki/ 2D_ Vibro seis_ Line_ 001). The training data were on 10% of the CMPs, which was about 80 gathers. Figure 7 shows a comparison of a hand-picked NMO with the predicted NMO by the RNN network.
In a similar fashion to Biswas et al. (2019b), Zhang et al. (2019aZhang et al. ( , 2019b tested LSTM to automate the picking. However, they combined the LSTM with a CNN model known as YOLO (You Only Look Once) and considered the problem as an object detection problem.
More recently, Park and Sacchi (2020) utilized CNN to automatically pick the semblance. They formulated the task to be a classification problem. The network's input was defined as a pair of a guide image, G, which represents the velocity semblance. The target images, T, contained the semblance image at a specific range, τ, where τ is the zerooffset two-way travel time. This is illustrated in Fig. 8. Each image, T, represents the velocity in the middle of the range, τ. To reduce the computational cost, the semblance images were down sampled to 50 × 50 pixels. The output class was defined by dividing the velocity axis in the semblance panel into velocity ranges such that each range was considered as a class. They tested the method with common network structures such as LeNet-5, AlexNet, and VGG16. They concluded that VGG16 was the best choice for their implementation. The network was trained using seven synthetic quazi-horizontal models. However, with the help of transfer learning, the application was extended to more complex models.
Transfer learning is defined as follows: Let A and B be two similar tasks. By using few samples from B to tune a pre-trained model on A, a good network that predicts from B can be achieved.
Park and Sacchi (2020) used six semblance panels out of 1400 from GOM data to perform transfer learning on a pre-trained network. The predicted model is similar to the NMO velocity with manual picking except at a part in the middle as shown in Fig. 9. The reason for that was explained as being due to the lack of samples on which to perform the transfer learning. This is a proof that transfer learning is vital when the model is not quazi-horizontal.
We summarize the reviewed studies on using machine learning in modeling NMO velocity in Table 1.

Velocity inversion
This section discusses some of the applications to invert for seismic velocity using ANN. These studies differ from those discussed in the previous section in that they provide a more accurate velocity that mimic the ones obtained from FWI. Some of the attempts implemented a direct inversion, which implies inputting the data and outputting the velocity  Haber and Tenorio (2003), and Zhang and Alkhalifah (2019). The later type is referred to in this paper as ML-assisted velocity inversion. Each of these is discussed in more details. -Polo et al. (2018) proposed an FC layer of ANN to predict the velocity directly from the shot gathers. They suggested extracting features from the shot gathers as doing so helped the training to converge faster and more accurate. To achieve that, they converted the data to a semblance cube and used it as features since the semblance contains patterns related to the velocity. The label for the network was the ground truth velocity. They further used three FC layers with dropout and batch normalization to test the approach. They conducted two experiments to test this method. In the first experiment, the output was a continuous-valued image and the label was composed of discrete values containing velocities. This case needed a post-processing procedure such as K-means segmentation to be applied to the output velocity images. In the second experiment, the actual labels and the predicted velocities were of continuous values. In addition, salt-bodies were included in some of the models. The two experiments performed similarly for layered velocity models. It did not perform well in the cases containing salt bodies. Salt bodies are typically challenging to invert for even in conventional approaches.

Araya
Without using any features, Wu et al. (2018) and Yang and Ma (2019) inverted for the velocity directly from the raw data. They recommended the use of CNN. While the Fig. 9 a is the hand-picked NMO velocity model for GOM data. b is the predicted velocity using CNN. c is the difference between a and b. Transfer learning is implemented using 6 CMP gathers indicated by the dotted lines (Park and Sacchi 2020)  (2020) Supervised CNN Automated the picking from the semblance former focused on inverting for models containing faults, the latter inverted for models containing salt bodies. Yang and Ma (2019) used a modified U-net architecture. In a typical U-net (Ronneberger et al. 2015), the input and the output are in the same dimension. In this work, the input was in space-time (x, t) while the output was in space-depth (x, z) dimension. The samples used for training are 2D synthetic velocity containing salt bodies. They used different shots generated from the same model as channels for the input. Therefore, the number of channels for each input was the same as the number of shots. The labels were the velocity models used to generate the data. The results showed promising capability of obtaining the velocity model directly from the raw data. The trained network was then used as an initial network for a different training set from SEG/EAGE salt models (Aminzadeh et al. 1996). The prediction of the network was not as good. According to Yang and Ma (2019), this may be due to the lack of training models. In FWI, the governing equation for the problem, which is mainly the wave equation, is well known. It would be desirable to take advantage of that and use a physics-guided machine learning approach. Biswas et al. (2019a) inverted for the velocity based on the physics using an encoderdecoder CNN network. In this technique, the input dimension, which is the seismic data, was reduced in the encoder part and then restored back in the decoder part. The output model was then used in the wave equation to generate a synthetic model, and the difference between the input and the generated data was used to compute the gradient. It should be noted that this is an unsupervised approach as there was no label. Rather, the physics was used to compute the gradient and update the network.
Several researchers in the area of FWI such as Van Leeuwen and Herrmann (2013), Alkhalifah and Song (2019), and Sun and Alkhalifah (2019b) proposed using robust misfit functions to overcome the limitations of the conventional least-squared objective functions. In line with this, Sun and Alkhalifah (2020) proposed to learn a more robust objective function using the concept of meta-learning. They formulated the problem to find the optimal objective (minimum loss) by replacing the conventional L2 norm with an ANN model. They formulated the meta-loss such that the network mimicked the behavior of the optimal transport of the matching filter misfit function (Sun and Alkhalifah 2019b). As a simple test, they applied the method on a simplified FWI by inverting only for a travel time shift between two traces. They plotted the learned misfit function at the first epoch and after 250 epochs and then compared it with the L2 misfit. The ML-misfit after 250 epochs showed better convexity than the L2 objective. They then learned the objective function using random 2D horizontal layers and inferred on the Marmousi model. Since the low frequencies were missing from the data, they introduced the well-known cycle-skipping problem. Because of this, the conventional FWI result was cycle-skipped while the ML-misfit inversion was not. This suggested that the learned misfit was more robust than the L2 objective.

Velocity inversion assisted by machine learning
In the theory of FWI, the multi-scale approach (Bunks et al. 1995) suggested starting the inversion from low frequencies and progressively including the high frequencies until the whole bandwidth of the data has been used. Despite this guarantee a stable convergence, the data often lack the low frequencies. Many researchers have attempted to extrapolate the missing low frequencies from the high frequencies (Hu et al. 2019;Wu et al. 2014). The effort had been limited to the single scattering assumption, known as the Born approximation, and the acquisition. Ovcharenko et al. (2018) used a feed-forward ANN model to extrapolate the low frequencies. Thereafter, and for computational efficiency for large inputs, they suggested using CNN for the extrapolation (Ovcharenko et al. 2019). They formulated the CNN such that the input is the high-frequency contents of a shot gather. The output was a single low frequency.
The network used in this approach consisted of four convolutional blocks, followed by pooling layers, and then two FC layers. The data used for training were generated with random models. The network could only extrapolate to a single frequency, implying that there were individual CNNs for each target frequency. They tested the network on the central BP 2004 model, which contained a large salt body. The available frequency bandwidth in the seismic data ranged from 2 to 4.5 Hz. They extrapolated the frequencies to 0.25, 0.5 and 1 Hz. The error of extrapolation increased at higher frequencies. This was possibly caused by the introduction of more complex contributions of subsurface features into the total misfit (Ovcharenko et al. 2019). For FWI applications, the extrapolated low frequencies data are typically first used in inversion. Then, the final inversion result was used as input for the subsequent higher frequency until all the bandwidth was covered. They used this approach to invert for the BP velocity model. The final result successfully reconstructed the salt body, which would have been difficult using the available frequency that started from 2 Hz.
There is a lot of information that can be used to impose constraints and regularize FWI. Such constraints include geological information. The equation currently used to connect different data information is based on some surface assumptions. To better connect the data, statistical principles have been effectively used to merge different information by using deep learning. Zhang and Alkhalifah (2019) learned the probabilities of the facies for each P-velocity (V p ) and S-velocity (V s ) from a nearby well. They then mapped the probabilities into the whole estimated model by using a 1 3 weighted sum, ∑ p i v i , where p is the probability of the individual facies, i. They used a network with four hidden layers, 64 neurons in each layer, and three input features, namely V p , V s and V s /V p , to output the facies. The algorithm is summarized below, and the flowchart is presented in Fig. 10: 1. Perform elastic FWI.
2. Extract facies information from a well log or any other sources. 3. Choose vertical profiles near the well from the estimated model in step 1. Then build the connection between these estimates and the interpreted facies by training a feed-forward ANN. 4. Use the trained network to predict the facies for the whole model and use a weighted summation to generate the P-velocity (V p ) and S-velocity (V s ). 5. Use the converted V p and V s velocities as input or regularization for another cycle of FWI. 6. Repeat the process if apparent error estimation exists.
They tested the above methodology on the BigSky field data. They manually interpreted 11 facies from a well log to obtain the initial model for the FWI by smoothening the velocity. They found that the inversion results along with facies were more accurate and higher in resolution than the conventional FWI. Figure 11 shows their inversion result. The yellow arrow points to high-resolution anomaly that was not capture by the conventional inversion.
The results of the literature survey discussed above are summarized in Table 2.

Remaining gaps in the current level of machine learning applications in seismic velocity modeling
Most of the work reviewed in this paper used insufficient data to train the models. It is well known that the performance of machine learning methods usually improves with the addition of more data in terms of quantity, quality, and relevance of the features. The data used often belonged to the same velocity model, which would adversely affect the generalization of the models. Transfer learning could be a solution to use the network in different velocity models. It does not guarantee good estimation if it did not capture the new features in the new model. In most of the ML velocity applications discussed above, the training is usually performed on random synthetic models, which may not be realistic. A typical case is holding some assumption such as invariant lateral velocity like in Park and Sacchi (2020).
Testing the application on real data set can be very different as the real data are contaminated with noise. The ML applications discussed above only considered simple cases where the medium is isotropic and acoustic. For example, NMO applications will surely fail to flatten the hyperbola in an anisotropic medium, as they do not account for the η parameter (Alkhalifah and Tsvankin 1995). It would be noted that in the inversion applications, a network is often trained to obtain only the acoustic model from the data. The goal of NMO is to obtain an initial estimate of the velocity model, which is an approximation process. Hence, all the methods discussed under this gave promising results with different shades of limitations. There are apparently ample rooms for improvement. In the case of the clustering method, the small variations in the data due to noise or in the diffused deep region of the semblance might lead to inaccurate predictions as observed Chen (2018) andBin Waheed et al. (2019). In using CNN for classifying the velocity based on the semblance (Park and Sacchi 2020), the semblance panel does not account for the surrounding semblance and the lateral continuity as expected of a good processor. The network only trained on the lateral homogeneous models. This imposes some limitations on it.
One of the major limitations in the application of machine learning in velocity inversion (FWI or traveltime tomography) is that the network is only valid for the specific dataset that was used to train it. Different models have different structures yielding different signatures in the data. For example, Yang and Ma (2019) trained a CNN model to capture the salts. When tested on models that contains salt bodies with layers of sediments, it failed to recover the layers. Besides, when the input of the network was shot gathers from raw data, the network was restricted to take a fixed number of channels (shots from the same velocity model). Including more shots from different locations would add more information and improve the generalization capability of the velocity model. However, this would be limited by the memory of the GPU. An alternative approach would be to extract features from the data. For example, Araya-Polo et al. (2018) only used one feature. From the theory of ML, extracting more features will definitely help the model to be more accurate. In addition, if the ML method used for inversion is guided by the physics rather than being completely driven by data such as in (Biswas et al. 2019a), then it would be prone to the cycle skipping problem like in the conventional FWI method.

Recommendations for future improvements
In this paper, we have presented a comprehensive review of some significant machine learning applications in velocity model building. We tracked and critically reviewed the evolution of various efforts directed toward automating the velocity picking for NMO velocity. The review exercise revealed that most efforts directed toward applying ML techniques in velocity modeling are limited to the traditional single-instance supervised learning techniques namely the families of ANN (feed-forward back-propagation, RNN, and CNN). Traditional clustering techniques, namely k-mean and DBSCAN, are also used to apply unsupervised learning. The review also revealed that some of the efforts used ML techniques either as the main engine for the inversion or to assist the traditional methods. We identified some gaps that remain to be filled as well as limitations that need to be improved on.
Based on the identified gaps and limitations, and in view of the need to accelerate the digital transformation agenda of the petroleum industry, we present the following recommendations to improve future applications of machine learning in seismic velocity modeling and estimation: 1. Since most of the reviewed applications of machine learning techniques used very limited amount of data for training, it might be useful as a guide to refer to the recommendations of Anifowose and Abdulraheem (2010) and Anifowose et al. (2017aAnifowose et al. ( , 2017b on the minimum amount of data considered sufficient for shallow networks and models. 2. Going beyond using only the semblance for NMO velocity with clustering methods, the feasibility of using other features such as coherency and some measures of continuity for input could be investigated. 3. Using big data from different velocity models is a necessary requirement for achieving better generalization of the ANN and CNN models.
4. Using a 3D semblance cube instead of 2D and applying a semantic segmentation rather than classification might better account for continuity and obtain more accurate velocity. 5. For velocity inversion methods that use the raw shot gathers as input, dimensionality reduction can be used as a pre-processing step to utilize only the most significant shots for better accuracy and to minimize memory utilization. Principal component analysis, discrete cosine transform, and auto-encoders are examples of such dimensionality reduction tools. Doing this will not only minimize memory utilization but will also reduce the computational cost and increase the efficiency. 6. Since problem formulation is more of art than science, solutions to different problems are formulated in different ways. These include using different input features and target variables, different learning types (regression or classification), and different learning methods (supervised or unsupervised). This suggests that more advanced, robust, and state-of-the-art ML techniques such as SVM, random forest, light gradient boosted machine (LightGBM), extreme gradient boosted machine (XGBoost), and extreme learning machines (ELM) can be utilized with evidently more effective formulations. 7. Most of the work done so far has been with ANN. Other techniques such as those mentioned above are more robust and powerful in classification problem and can be used in place of ANN. Despite this, they have been largely underutilized in seismic velocity modeling. SVM, in particular, may not require large dataset unlike in the case of ANN (Shao and Lunetta 2012). Random forest and ELM have been presented to be robust and have the capability to avoid overfitting (Zhu et al. 2005;Bernard et al. 2012;Liu et al. 2013). LightGBM (Zhang et al. 2019a(Zhang et al. , 2019b and XGBoost  have proved to be robust and powerful in other fields. 8. The applications should be very explicit on how the models used for training are generated. Realistic models should contain some earth structures such as faults, anticlines, and salt bodies. Oftentimes, machine learning methodologies are validated on specific models containing one structure such as salt body. Generating more realistic models that combine more earth structures in a single model is a huge-demand and much-desired area of research that will not only benefit training the machine learning models but also prove to be useful to validate any general theory on the generated models. 9. The ML techniques applied so far in seismic velocity modeling are single-instance models. Since these techniques have their respective areas of strengths and weaknesses, we recommend the application of hybrid and ensemble learning algorithms (Anifowose et al. 2017a(Anifowose et al. , 2017b. These new algorithms have the capability to combine the respective benefits of existing techniques by complementing the weaknesses of some by the strengths of the others.

Conclusions
We reviewed some machine learning applications in velocity modeling. These applications include automation of semblance picking for NMO velocity, obtaining the velocity directly from seismic data, and assist in inverting the velocity. Various machine learning technique are used in these applications, namely unsupervised, supervised, and semi-supervised methods. We also provided some recommendation that can be used for the future. The field of artificial intelligence is evolving very fast, which raise the potential of improving the applications further. We hope that this paper will be of benefit, especially to young professionals to understand the evolution of machine learning applications in seismic velocity modeling from the inception to the current development. It will also benefit researchers in this field to direct their efforts toward providing complete automation of and more robust solutions to various seismic velocity estimation challenges.
Funding None.

Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.