Seismic Full Waveform Inversion Accelerated by Overlapping Data Input and Computation

Seismic full waveform inversion (FWI) is a powerful technology to obtain high-precision and high-resolution images of subsurface structures. However, FWI is a data-intensive algorithm that needs to read extensive seismic data from disks, which significantly affects its performance. We proposed a portable parallel framework to improve FWI by overlapping data input and computation (ODIC). The framework is based on POSIX threads (Pthreads), which is a standard thread API library and can create a parent thread and a child thread in the FWI process. The former is used to perform computation and the latter to read data from disks, both running simultaneously. This framework has two attractive features. First, it is broadly applicable; it can run on almost any computer from a laptop to a supercomputer. Second, it is easy to implement; it can be readily applied to existing FWI programs. A 3D FWI example shows that the framework speeds up FWI considerably.


Introduction
Seismic full waveform inversion (FWI) can be used to obtain sufficiently accurate information from seismic recordings to reconstruct velocity models of the subsurface, which are often highly accurate and high-resolution images of the subsurface structures (Rao & Wang, 2013;Rao et al., 2016;Virieux & Operto, 2009;Wang & Rao, 2009).However, this often requires reading massive data from disks, which hinders their wide application, especially in 3D cases with TBs or PBs of data.Therefore, it is necessary to develop parallel frameworks that work well for FWI.
For FWI, efficiently accessing an enormous amount of data has always been a major challenge.
Distributed storage computing is one of the most effective technologies (Arrowsmith et al., 2022).Distributed file systems, such as Google File System (GFS) (Ghemawat et al., 2003) and Hadoop Distributed File System (HDFS) (Shvachko et al., 2010), and parallel computing frameworks, such as MapReduce (Dean & Ghemawat, 2008), Hadoop (Shvachko et al., 2010) and Spark (Zaharia et al., 2010), enable high throughput processing of TB or PB data.Using Hadoop and HDFS, Addair et al. (2014) implemented a global-scale cross-correlation analysis of a 1 TB seismic waveform dataset and achieved an average data processing rate of 16.7 GB/ min and accelerated the processing by 19 times.Magana-Zook et al. (2016) extended this experiment to analyze a dataset of over 40 TB using Spark and accelerated the analysis by 15 times.
However, distributed storage and computing depend on high-performance computers such as clusters and supercomputers, limiting their use on traditional computers, e.g.laptops and desktops.Moreover, migrating large amounts of data from conventional storage to distributed storage and converting existing code from C/C?? or other languages to Hadoop or Spark are quite difficult.
This paper is primarily about developing a novel and portable parallel framework that works well for FWI.The CUDA stream technique overlaps computation on GPU and data transfer between CPU and GPU (Cheng et al., 2014).Inspired by this technology, we propose a framework that can parallelize computations and data accesses in FWI.We then apply the proposed framework to a shot-encoded FWI (Krebs et al., 2009).We will show that the framework is widely applicable and simple and can significantly speed up FWI.

Recap of POSIX Threads (Pthreads)
In modern Unix/Linux operating systems, a process is the instance of a program executed by one or more threads (Silberschatz et al., 2004) and a thread is the smallest sequence of programmed instructions that can be managed independently by a scheduler (Lamport, 1979).A thread can thus be considered a component of a process, and several different threads in a given process share resources, such as memory, and can be executed synchronously via multithreading technologies.
Pthreads is a parallel execution model and allows a program to control several different work flows which can be overlapped in time.Each flow is called a thread and the creation and control of these flows is done by calling Pthreads APIs specified by a standard POSIX (Butenhof, 1956).
As one of the basic API libraries in Linux operating system, Pthreads can run on almost all computers, namely laptops, desktops, workstations, servers, clusters and supercomputers.Moreover, it is a portable library developed in the C language, so it can be easily applied to existing FWI programs.
For a serial program (Fig. 1), a parent thread and a child thread are created by Pthreads.The first performs computations (subfunction1) and the second reads data from the disks, both executing simultaneously.This parallelizes the serial programs and thus improves its performance.

Full Waveform Inversion Improved by Pthreads
In FWI, the reconstruction of subsurface velocity models is implemented iteratively.The synthetic seismic response based on the estimated models increasingly matches the observed field data.Therefore, the objective function is generally defined in terms of data misfit as follows: where m is the model to be inverted; d obs and d cal are vectors of the observed data and the synthetic data, respectively.The objective function is minimized by iteratively updating the model.The model updating is described as follows (Wang, 2017) where k is the number of iterations, a k is the optimal step length determined by a line search method, g k ¼ rJ m k ð Þ is gradient vector determined by cross-correlating theoretical and back-propagation wavefields, and B k is an inverse Hessian matrix.The model updating follows the negative direction of the gradient vector (Wang, 2017).In a L-BFGS method, B k g k can be calculated by a recursive algorithm (Nocedal & Wright, 2006;Rao & Wang, 2017).
In a shot-encoded FWI (Fig. 2), which is performed serially, one of the inversion iterations can be divided into five steps: (1) Reading and encoding the data from hard disks (RED); (2) Calculation of the theoretical wavefield of the initial model estimate (CTW); Of these steps, step (1) and step (2) are independent and can therefore be carried out synchronously, while the others may only be implemented after the first two steps have been completed.
In a single inversion iteration (Fig. 3a), a child thread and a parent thread are created in an FWI process.The former is used to implement step (1) and the latter to implement step (2), both of which are executed simultaneously.This allows us to parallelize the computation and data input to improve FWI.
In the two adjacent inversion iterations (Fig. 3b), step (1) in the next iteration is independent of step (3)-(5) in the current iteration, and step (1) and step (2) are also independent in the next iteration.Thus, when step (1) is completed in the current iteration, a new child thread for step (1) in the next iteration can be started immediately and executed simultaneously with step (3)-( 5) in the current iteration and step (2) in the next iteration.This further parallelizes data input and calculation.If the time to execute step (1) and step (2)-( 5) is the same, FWI reaches the maximum speed-up of two times.

Effectiveness of the Parallel Framework
To test the effectiveness of the proposed framework, we apply it to a 3D FWI.A SEG/EAGE Overthrust model (Fig. 4a) is used as the actual velocity model.The size of the model is 8 Â 8 Â 1.86 km 3 .This velocity model is discretized into 401 Â 401 Â 94 grids with a cell size of 20 Â 20 Â 20 m 3 .
We set a Ricker wavelet with peak frequency of 15 Hz as the source signature and generate synthetic shot-gathers from 961 shots located at the surface with a shot interval of 240 Â 240 m 2 .Each shotgather is composed of the traces from 34,596 receivers with a trace interval of 40 Â 40 m 2 .The total volume of the entire data set is about 751 GB.All individual shots are combined, according to a shotencoding method (Krebs et al., 2009), to form a super shot-gather, which is used as input for FWI.
We use a multi-scale inversion strategy (Ravaut et al., 2004) to implement the shot-encoding FWI and split by bandpass filtering the super shot-gather into three frequency bands: 0.2-6, 6-18, 18-30 Hz.A smoothed overthrust model is used as the initial estimate for the inversion of the first frequency band.Then, the inverted model of the lower frequency band is used as the initial estimate for the inversion of a higher frequency band; 100 iterations are performed in each inversion segment, and the final inversion results (Figs. 4b, 5c, 6c) are obtained after 3 Â 100 iterations.
Figure 7 compares velocity slices of the true model, the initial model and the inverted model (at depths of 0.4 km, 0.6 km and 0.8 km).From these inversion results, we can see that the FWI implementation accelerated by the proposed parallel framework can reconstruct the overthrust model stably and reliably.
The 3D FWI is implemented on a single-node server, and Table 1 shows some main computer configurations.To evaluate the performance of the proposed framework, we give the computation time for 100 iterations in the last inversion segment (18-30 Hz). Figure 8 shows the computation timelines of the improved shot-encoded FWI and the traditional version.The two timelines (Fig. 8a) have almost the same values for some iterations because the source codes are fixed every five iterations.It is obvious that the new parallel framework can speed up FWI considerably.Table 2 gives the computation time of 80 iterations in Fig. 8b.It should be noted that before applying the proposed framework, our FWI was improved by a heterogeneous parallel scheme (MPI ?CUDA) and achieved a speedup of about 20 times, and a shot-encoding method reduced the number of waveform simulations in FWI from 3n (n, total number of shots) to 3. Based on these improvements, our framework continues to improve FWI and provides a speed-up of 1.55 times.

Conclusions
In this paper, we have proposed a novel and portable parallel framework that works well for FWI.We have shown that the framework significantly speeds up FWI by overlapping data input and computation (ODIC).The advantages of this framework are its wide applicability and low complexity.Unlike distributed storage and computation, the framework can run on conventional computers such as laptops, desktops, workstations and servers.It also has the potential to achieve higher performance when run on clusters or supercomputers.In addition, applying this framework to existing FWI programs is easy.
As a basic API library in Linux operation system, Pthreads is universally applicable for various scientific computation tasks.Therefore, the developed framework is also suitable for other numerical

Figure 1
Figure 1 Overlapping data input and computation (ODIC) in a serial program via Pthreads

Figure 2
Figure 2 Workflows of a shot-encoded FWI

Figure 5
Figure 5 Velocity profiles at the position of y = 4 km.a The true model.b The initial estimate.c The inverted model

Figure 6
Figure 6 Velocity profiles at the position of x = 4 km.a The true model.b The initial estimate.c The inverted model

Figure 7
Figure 7 Velocity slices at depths of 0.4 km (left), 0.6 km (middle) and 0.8 km (right).a The true model.b The initial estimate.c The inverted model