Acceleration strategies for elastic full waveform inversion workflows in 2D and 3D

Full waveform inversion (FWI) is one of the most challenging procedures to obtain quantitative information of the subsurface. For elastic inversions, when both compressional and shear velocities have to be inverted, the algorithmic issue becomes also a computational challenge due to the high cost related to modelling elastic rather than acoustic waves. This shortcoming has been moderately mitigated by using high-performance computing to accelerate 3D elastic FWI kernels. Nevertheless, there is room in the FWI workflows for obtaining large speedups at the cost of proper grid pre-processing and data decimation techniques. In the present work, we show how by making full use of frequency-adapted grids, composite shot lists and a novel dynamic offset control strategy, we can reduce by several orders of magnitude the compute time while improving the convergence of the method in the studied cases, regardless of the forward and adjoint compute kernels used.


Introduction
The retrieval of physical properties of the Earths interior using seismic data has always been challenging for both academia and industry.As such, it has been subjected to intensive research for the last decades.One of the methods that potentially allows to extract more information from seismic data than travel-time tomography is full waveform inversion (FWI).Theoretically, it can provide models of physical parametres with higher spatial resolution than other classical methods such as travel-time tomography.In parallel, due to the increase in computational power, large-scale complex forward modelling has become affordable.
FWI is commonly formulated as an iterative process that seeks to improve the model by minimizing the differences between the synthetic traces obtained using a reference model (which can be poor in high-frequency content) with the real seismic trace recorded experimentally, hereafter referred to as data, by means of a cost functional [44].Nevertheless, and given the number of parametres to be inverted, statistical methods such as Monte Carlo are still not applicable for large 3D problems, especially when high-frequency contents are involved.In this sense, the centre piece of FWI is the so-called adjoint method, which allows us to obtain the gradient of the misfit function for the current model by cross-correlating both incident and back-propagated wavefield [15,16,31,32,44].With the gradient at hand, an iterative optimization problem can be set in order to find the model that fits best the field data.
Although conceptually simple, FWI application to field data remains challenging due to the numerous inherent problems.Extreme non-linearity, irregular non-convex misfit functions, compensation for geometrical spreading, insufficient knowledge of the physics and inadequate starting models together with low-frequency limitations in the data are some of the problems that must be overcome when inverting a seismic dataset.This explains the sparking interest for more complex and accurate inversion methods.Additional mathematical insights such as preconditioning and regularization also play an important role in reducing the non-linearity of the problem and are often required in order to ensure convergence [3,12,20].On the other hand, FWI becomes very challenging as at least two parametres must be determined either independently or using constrains, and hence, the parametre space becomes larger and more complex [7,8,46].
In addition to the algorithmic challenges of elastic FWI, sometimes referred to as EFWI, there is an inherent barrier which is the cost associated with running the scheme.In order to mitigate the computational burden, previous work has mostly focused on improving the efficiency of the computational kernel [11,13,21], which takes most of the FWI execution time.The computational kernel is responsible for simulating seismic wavefields with the purpose of measuring data misfits, generating gradients and obtaining approximations to the Hessian.Some acceleration efforts focusing on the FWI computing kernels include porting these kernels to accelerator-based hardware [5,11,39], optimizing the kernels' performance in general purpose processors [23] for off-the-shelf hardware or parallelizing the kernels in many compute units [30,36,38].Nevertheless, we wish to show here that there are workflow strategies that are orthogonal to kernel optimization and which result in notable computational savings, thus making elastic FWI a routinely applicable tool for 3D datasets.Such strategies are mostly independent on the specifics of the computed kernel such as whether it is specified in time or frequency domain, uses finite differences or finite elements or whether it is highly parallel or not.We will exemplify the most important strategies that we have explored for a set of 2D and 3D cases, and show how some of them even result in an improvement in model fit and a reduction in kernel size, both collateral benefits for an FWI application.In summary, we will apply the following strategies: (1) multi-scale grid-adapted gradient generation, (2) shot decimation by means of composite shot lists and (3) dynamic offset control (DOC) to focus on reflection events with minimal user interaction.
This paper is structured as follows: in the first section, we briefly present the adjoint method and the more important features of our algorithm.In the second section, we present our strategy for selecting a dataset and show that it allows for obtaining accurate inversion.We also illustrate the potential of our approach by including density effects in the modelled datasets.In the third section, we will apply our workflow to a 3D elastic model and discuss some computational aspects and advantages of our implementation.

Theory
We will briefly outline the general FWI algorithm and also remark which specializations have been made in our implementation, used in the examples hereafter.The mathematical formulation of FWI consists of the minimization of an error or misfit function E(m) [44,48], where m is the current model.The computation of E requires modelling accurately the response of the model m by means of a simulation.Many choices of E functionals exist, L 2 being the most common choice.However, some other norms have been successfully applied to the FWI problem, such as the L 1 norm [7,8,45,48], the Huber norm [19] or separating phase and envelope misfits in the time-frequency domain [16,24].In this study, we employ the normalized least square criterion L 2 given by where N is the number of receivers, n the number of samples of each trace, x r and x s are the receiver and source position respectively, u(x r , x s ; m k ) the components of the displacement vector obtained from the current model parametres m k at the k th iteration and u(x r , x s ) stands for this recorded in the seismic experiment.A misfit function such as Eq. 1 is typically an irregular surface, and hence, finding a global minimum can be challenging if not impossible.In this context, multi-scale techniques [4,9,32,35,40,48], combined with gradient preconditioning and regularization methods, have been developed to sequentially incorporate higher wavenumber information into the inverted model.This results in a reduction of local minima effects by means of either selecting increasing individual frequency samples (in frequency domain) or broadening a low-pass filter by increasing its corner frequency (in time domain).
We represent seismic wavefields with the elastic wave equation and use a time-domain approach to simulate synthetic wavefields.The equation reads where f s is the source function at position x s , v the particle velocity, ρ the density and σ the stress field, which is related to the strains by σ = C : ε ε ε being C the stiffness tensor which defines the elastic properties of the model.Notice that for the isotropic case, the model has only three free parametres: λ, μ and ρ.Our seismic modelling scheme is an eighth order in space and second order in time staggeredgrid finite-difference scheme in time domain [28,47] that uses a mimetic approach to accurately solve the free-surface condition and hence allows for a less restrictive grid spacing criterion in the computations [33].For the bottom and lateral boundaries, we use sponge zones as reflection-free condition [41].
The model gradient associated with the misfit function E of each source or shot is constructed by cross-correlation of a forward and adjoint wavefield, obtained from an adjoint source as explained by [14].Using the strains of both forward (ε ε ε) and adjoint (ε ε ε † ) wavefields, the kernels for ρ, λ and μ are defined by where v and v † stand for the forward and adjoint particle velocity fields, respectively, and T the simulation's duration.The search direction P k at the k th iteration is obtained by means of a non-linear conjugate gradient method and is given by where a Polak-Ribière criterion is used to obtain β k .The last stage consists in a line search algorithm that finds a satisfactory α k and then updates the models according to Nevertheless, using gradients obtained directly from the kernel results in a slowly converging algorithm; thus, we apply a two-step preconditioning: At a first stage, we apply a change of variables and choose to work with m=log e (m) instead of m as explained by [25].In all the numerical tests, we observed that using log e leads to faster convergence, reducing the number of iterations per frequency.The second step compensates for geometrical spreading by means of dividing the gradients by the square of the forward illumination; this combination has shown to be sufficient in all the tests we performed.The complete elastic FWI workflow is shown in Fig. 1 and is rather general for FWI applications.Notice that the scheme typically involves many shots, and hence, it is beneficial to apply parallelization techniques at the shot level, as indicated in the grey boxes of Fig. 1.Particular to our own FWI implementation, we wish to remark that in the following examples, the computational grid does not match the receiver or source grids at any given frequency band.In addition, we have not fixed any part of the model neither have we preprocessed the synthetic data to select particular phases.In order to avoid near-field effects, we typically remove the nearest offset traces for each shot (typically offsets < 150 m).This helps reducing the source footprint in the gradients.

Workflow optimization overview
Three main optimizations are suggested at the workflow level, which affect the performance and quality of elastic FWI.Before analyzing in detail their behaviour, we proceed to outline their main traits.
-Frequency-adapted grids: Full wavefield seismic modelling algorithms are limited by resolution constraints, which are determined by the dispersion properties of the modelling algorithm.As a consequence, the grid spacing Δ needed to simulate wavefields accurately is proportional to the minimum wave velocity in the model and inversely proportional to the maximum frequency modelled.The lower bound to the inverse proportionality constant is called points per wavelength or ppw and is constant through all practical frequencies.Notice that ppw is determined by the actual algorithm and the accuracy requirements set by its user.As the ppw is a lower bound, the standard approach is setting the grid spacing Δ to the minimum required by the highest inverted frequency and minimum velocity, thus setting a constant grid for the complete FWI procedure.Nevertheless, small Δ values result in many grid points and thus expensive simulations.Furthermore, the minimum velocity value might vary due to model updates during the FWI iterations.We propose adapting Δ to the optimum value at each inverted frequency, thus regridding the model to suit the new sampling for modelling [27].In a typical multi-scale FWI [9,32], this means large computational savings at the lowest frequencies inverted but no benefit at the highest frequencies, which in turn lead the cost count of the complete FWI procedure.In our case, we use slowness and density trilinear interpolation in a Cartesian grid to move between scales, but different modelling algorithms might favour different regridding options.This reduces the computing time required to obtain a model update, compared to using the maximum frequency grid spacing at each iteration.
-DOC: This strategy is a data decimation approach which limits the maximum offset used at each shot.
The technique was originally designed so that surface wave effects do not become dominant in the gradient, so that reflection modes keep updating the model in elastic FWI.This does not mean that we neglect surface waves, but their overall impact on the gradient is reduced.As a data decimation approach, DOC requires additional validation of its impact on the resulting models compared to full data approaches.Such validation will be shown in the next section.Incidentally, offset limitation has a strong impact on performance, as the size of each shot resolved is much smaller than the original full-offset shot.Notice that this might end in dissimilarly sized shot domains and, hence, parallel imbalance unless master-worker strategies as those described above can be used.This technique is also particularly well suited to master-worker FWI applications [22,37], where a different amount of computational resources can be allocated to each worker.Furthermore, the parallel efficiency and load balance of such schemes are excellent whenever the number of tasks (e.g.shots) is much larger than the amount of available workers [2].
-Composite shot lists: This last strategy [27,49] simply distributes the complete shot list in a subset of shots which are run sequentially at each gradient iteration.In this way, for a fixed number of gradient iterations, the total amount of shots simulated is smaller, reducing the computational resources needed.Additionally, no cross-talk problems appear such contrary to source encoding techniques [3,26].Furthermore, composite shot lists can be combined with the DOC strategy described above and virtually any acquisition geometry can be used.Last but not the least, as in any other multi-shooting strategies, depending on data redundancy, the savings from composite shot lists might be large and apply to all shot loops in the algorithm.
In Fig. 1, we can identify where each of the strategies described above affects the standard elastic FWI workflow.Each strategy is constrained to a particular set of workflow lines and aims at either reducing the size or the amount of the computing kernels that need to be run.In the next sections, we will exemplify the strategies with 2D and 3D models to illustrate both the quality and speedups obtained with respect to standard workflows, using exactly the same computational kernel in all cases.
3 Workflow evaluation in idealized scenarios

Offset range and inversion quality
In the following, we present the inversion of a 2D model based upon an x-line section of the 3D SEG/EAGE overthrust model [1].The starting model is presented in Fig. 2 bottom.The acquisition geometry includes sources with a spacing of 100 m starting from x = 1050 m, resulting in a total of 151 shots.Receivers have the same spacing but start 50 m from the source, i.e. in a roll-along acquisition ending in roll-on/roll-off near the first and last shot locations.We also have included a free-surface condition at the top and the total simulation time is 6 s.The source used is a vertical force (z-direction) with a Ricker wavelet having a central frequency of 10 Hz.The model dimensions are 16 × 3.2 km.For the sake of simplicity, the V s /V p ratio is constant and set equal to 0.5 and ρ is set to 1000 kg/m 3 .Figure 3 shows the multi-component data (P, v x and v z , respectively).We observe that for such elastic model, surface waves dominate the amplitudes.Nevertheless, both Lamé parametres are freely and simultaneously inverted.
Such data can be inverted using multi-scale elastic FWI using a low-pass filter with cutoff frequency at 2.5, 3.4, 5.1 and 6.4 Hz sequentially, running 20 iterations per frequency.The starting models consist in a Gaussian smoothing of the true ones (see Fig. 2 bottom) with a correlation length of 500 m as proposed in the literature [7,46] which is similar in quality to those that could be obtained by travel-time tomography.The comparison between full offset and DOC is presented in Fig. 4 (top and bottom, respectively) and the multi-grid and multi-scale parametres used for inversion are summarized in Table 1.Although the main features are present and hence the inversion for full offset seems successful, the algorithm has been unable to attain a sharper image of the model, because it has become trapped in a local minimum.Our hypothesis is that high-frequency content is strongly related to surface waves having large Fig. 2 Target and starting models for elastic inversion.From top to bottom and left to right: the V p and V s target models; V p and V s starting models amplitudes and hence taking a prevalent role in the inversion at the expense of information from deeper structures.Brossier et al. [7] suggested that such kind of data could be better inverted by applying sequential inversion to successively damped traces, in order to reduce the non-linearity of the misfit function.Alternatively, one could remove surface wave from the data and use elastic FWI with absorbing boundaries at the top of the model.In this work, our approach for elastic FWI consists in employing data as is, without damping, only constraining the maximum offset such that we decimate the data in the spatial domain.In particular, we will limit the maximum  offsets inverted by a distance proportional to the characteristic wavelength of the model.This means that the inverted offsets are smaller, the higher the corner frequency applied in our multi-scale method.Such strategy allows for a strong reduction of the surface wave footprint with respect to reflections and refractions.Furthermore, the method leads to a drastic reduction of the overall computational effort, depending on the offset selection strategy chosen.The maximum offset for the DOC method is given by max where f 0 is the cutoff frequency of the low-pass filter applied to the data, D a parameter to be defined by the user, and x s and x r stand for the source and receiver positions, respectively.The example above uses D = 2V P .In Fig. 5, we present the velocity maps of the resulting models obtained with DOC after a total of only 80 iterations.We observe very detailed structures at shallow depths; the two dipped structures at ranges 5 to 10 km at the bottom are well recovered as well.The small channel at range 6 km and depth 2 km is also correctly retrieved by our algorithm.Curiously, the V p model is not as sharply recovered as the one presented in the work of Brossier et al. [7], likely due to using directional force sources instead of explosions and a penalization of direct waves inherent to offset limitation associated with DOC. Figure 6 compares the results of our inversion with vertical logs.We observe a very good match between target and inverted velocities, especially for the shear velocity.It seems clear that the upper part is almost perfectly recovered, with very fine details due to the inclusion of free-surface effects [29].We acknowledge, nevertheless, that our inversion lost some accuracy at 12 km, seemingly because of cycle-skipping effects although the reflectors are correctly located.

The choice of parameter D
The objective of this subsection is to discuss the influence on our elastic FWI scheme of the parameter D in Eq. 6.To that goal, we perform a series of inversions of a well-known model, so that we can evaluate the model fit for different choices of D. The inversion parametres, starting models and strategy are those described in the previous section.We use a set of D values and evaluate the quality of the resulting inversions by means of the L 1 norm of the V P and V S obtained models; data misfit is harder to compare because inversions at each D value involve different amount of data.The range of D used is V s /2, V s , V p , 2V p and ∞. Figure 7 displays errors at the end of each frequency inversion for each D value.It is clear that full offset is the worst option, already from the first frequency band.We observe that reducing D leads to more accurate inverted models, although there appears to be a limit beyond which reducing D results in worse inversion results.The best overall result is obtained with D = V P , although the range D ∈ [0.5V P , 2V P ] seems satisfactory.In fact, there appears to be a different behaviour at low and high frequencies, with smaller D values preferred at low frequencies and slightly

Final
Fig. 7 Inverted V p and V s mean relative error with respect to the true velocity models: starting model, first, second, third and fourth frequency in green, red, blue, magenta and black, respectively higher values for higher frequencies.This suggests that the optimal D value might depend on frequency.Nevertheless, the best value of D in real applications might be obtained by QC for a range of D at the lowest frequencies and keeping D constant throughout the frequency range.We remark that DOC results in drastic reduction of the computational needs.If combined with frequency-adapted grid spacing and model slicing, we could obtain gradients of each shot with the least possible computational resources.In our case, shots cover the whole receiver span plus a padding region of 1500 m at both sides and absorbing boundaries.In Fig. 8, we plot the computational savings at each frequency for the 2D case previously depicted and a 3D scenario shown later in this article.In this figure, we use D = V P and savings are expressed as the fraction of the compute time required for a full offset (i.e.D = ∞).Clearly, the higher  the frequency, the higher the savings in CPU time, reaching up to a ×20 faster runs at 6.4 Hz.In 3D, the case is even stronger towards using DOC.

Test cases with independent V s and V p
We conclude this section with two test cases (namely cases 1 and 2), using uncorrelated V s and V p for modelling.For this purpose, we add a low-frequency perturbation invariant in the x-direction to the shear wave velocities of Fig. 2 (target) as illustrated in Fig. 10.Density is 0.415 times the V p values.The starting models are those of the previous subsection.In case 1, we use sources in the z-direction and also include free-surface into the modelling.For case 2, we use explosive sources and model an infinite medium.Because of the missing low-frequency in the model, we can not make use of all the dataset as we do in the previous section, such that the inversion strategy differs somewhat from that presented in Table 1 for the starting frequency.First, we fail in starting at 2.5 Hz, regardless of the selected strategy or windowing.So we start to invert at 1.7 Hz.We also perform a first set of 20 iterations with only the first 3.4 s, in order to constrain our inversion to the first arrivals, which seems reasonable.In a second step, we relax the windowing and include all the arrivals.Table 2 summarizes the inversion strategy.
Figure 9 presents the inverted velocity maps for cases 1 and 2. The results confirm those of the previous subsection.We observe that the best images are obtained when using explosion sources and no free surface.Reflectors are mostly flat, with fine details in the geological structures.Even the low-velocity reservoir at range 6 km and depth 2000 m is correctly picked by our algorithm.We also observe a quite strong overshooting on the fracture zones, although the reflectors are correctly positioned.When using vertical force sources and free surface, reflectors are no longer flat as seen in the previous section.Figure 10  Detailed observations show that the V p wave velocity is better recovered for test case 2. On the other hand, shear wave velocities appear sharper when the free surface is not included.Nevertheless, in both cases, we remark that the algorithm has been able to converge efficiently towards an acceptable solution.On the other hand, we observe that the convergence is slightly smoother for case 2 than for case 1 although in both cases, data misfits are diminished to below 1 %.

Application to 3D elastic full waveform inversion
Previous studies have already presented results of 3D elastic full waveform inversion: [6,10,17,18,34,42,43,46].In order to validate our approach in 3D, the SEG/EAGE Overhrust model is used.The total model size is 16 × 16 × 3.2 km 3 .The shear velocity is taken as half of the compressional velocity and density is constant and set to 1000 kg/m 3 .The geometry acquisition consists in 79 fixed lines of 79 receivers each, resulting in 6241 fourcomponent receivers equally spaced in both the xand y-directions.This acquisition differs somewhat from that proposed in [46], in particular we use more sources but less receivers.Sources are located at the centre of each cell of four receivers (see Fig. 11) and only have a component in the z-direction (vertical forces).We used the same Ricker wavelet as described previously.The modelling algorithm is an eighth-order in space and secondorder in time, explicit time-domain, staggered-grid, finitedifferences method, which includes the free-surface reflections at the top of the model by means of mimetic operators Fig. 11 Acquisition geometry for modelling together with the composite shot list distribution used for the 3D elastic full waveform inversion  [49] and split the complete shot list in four sublists, in a staggered manner as shown in Fig. 11.This mostly leads to a similar illumination for each sublist, thus justifying the use of a non-linear conjugate gradient method for optimization, such that we can use the last search direction to update the new one.As a further acceleration strategy, we apply the DOC technique in order to reduce the domain size for forward, backward and test simulation, respectively, thus significantly reducing the computational cost, especially for the highest frequencies.As a side effect, we are no longer forced to use domain decomposition, because for all the frequencies inverted, all shots fit into a single node's memory.As mentioned before, this also reduces drastically the total amount of data used in the inversion, and consequently, the associated data processing operations such as low-pass filtering, time interpolation, adjoint source calculation or intermediate storage.In this numerical experiment, we choose to use the maximum of the shear velocity as D parametre, thus allowing for very small model sizes for each shot during inversion.The whole strategy for inversion is summarized in Table 3.The complete 3D elastic FWI run was executed in 88 nodes resulting in 3 % of the computational ressources used to simulate the dataset.Figure 12 shows the 3D volumes obtained after inversion.The overall results show a good match between target and inverted models.For both shear and compressional velocities, the fracture zone is clearly delineated, as well as the thinner layers.We also examine three random velocity profiles shown in V s .On the other hand, V p lacks some resolution, allegedly due to the missing shorter wavelengths compared to V s at the same frequency, but also because we did not carry out more iterations for computational reasons.At this point, we wish to recall the strategies which have allowed to speed up the computation of elastic 3D FWI in this case and what has been their relative contribution.This is summarized in Table 4. Due to the cost of running the FWI system without using the strategies presented in this work, in the following section, we will depict how to extrapolate such cost without actually running the whole inversion.

Discussion
In the last example, we have shown how our elastic 3D FWI-accelerated workflow is able to sustain roughly a ×100 theoretical speedup compared to a conventional 3D elastic FWI workflow.By conventional workflow, we mean no grid coarsening neither DOC nor shot lists.In Table 4, it is clear how DOC compensates for the diminishing effect of grid adaptation to frequency, thus obtaining a significant speedup for the complete frequency range of our inversion.This, however, is a theoretical number that does not account for a number of effects and consequences of our strategy, namely overheads due to model and data interpolations, gradient tapering, load unbalancing or increased boundary/main-grid ratios.We can, nevertheless, use the run time of our forward modelling for this particular test in order to see how far we are from that theoretical speedup.We can do so because both the forward modelling (FM) and FWI kernels share the same code for wave propagation, boundary conditions, source insertion and receiver data retrieval.Taking t F M 10H z the total simulation time of the modelling, we can compute the cost t convF W I 6.8H z of a conventional 3D FWI using the complete offset range but only the shot number and frequency content associated with the highest inverted band in our accelerated algorithm.The count is in which we use the total iteration count it FWI = 40 as in Table 2, the amount of simulations per gradient computation  Every speedup is computed with respect to running a conventional elastic 3D FWI workflow at the same frequencies and with the same shots and iterations.We have taken into account the additional 1500 m padding per side used in our DOC implementation Now, normalizing the compute count of our inversion vs our modelling, we find Hence, we can conclude that we are able to obtain an aggregate ×53 speedup compared to the cost of running the same inversion with a conventional 3D elastic FWI workflow, independent of the kernel architecture or its particular optimizations.This means that we are obtaining about a 51 % of the expected (theoretical) performance gain, for the reasons briefly described at the beginning of this section.
As for the result quality, arguably, using less data also results in a smaller inversion effort, although the data subtracted from the inversion dataset will not be able to improve the model.Nevertheless, in the previous sections, we have focused on explaining how, in our test cases, and without data pre-processing, such long-offset data often results in worse inverted models, most likely due to the algorithm getting stuck in a local minimum if no damping has been previously applied to traces.the continuous use of interpolating schemes in our accelerated algorithm can account for extra errors in the inversion, as does narrowing the distance to the artificial absorbing boundary conditions resulting from DOC which can introduce extra artefacts into the gradients.Finally, the composite shot list strategy somehow should result in slower convergence.We do not account for the individual effects of all of these limiting issues in the present study.Nevertheless, the final inverted models show a good enough quality to justify the vast computational improvement allowed by the workflow presented here.
It is worth paying special attention to the novel dynamic offset control technique here, based upon reducing the maximum apperture of each shot in FWI, hence reducing the computational cost when moving into higher frequencies.The technique allows for fast convergence with no data picking even in the presence of ground roll.We have exemplified the method by analyzing the effects of larger offsets on inversion results, reaching the conclusion that a suitable value for D should be max(V s ) ≤ D ≤ 2 max(V p ).Nevertheless, we remark that we lack prior ways of determining the optimal D value, which can be considered a drawback of this technique.It is important to notice that the strategies shown here are not limited to a single acquisition type, but could be used for any land or marine acquisition without fundamental differences.This is in contrast to other strategies such as encoded multi-shooting [3,26] which behaves poorly whenever no fix-spread acquisition is present.
We also remark that, in order to be able to take full advantage of the strategies presented here, it is recommended to use a workflow which can afford severe parallelism imbalances at the shot level.In our case, the choice has been a master-worker workflow where each frequency is an independent workflow itself, and pre-processing, postprocessing and kernel are the main worker roles.

Conclusions
We have presented a set of general strategies towards reducing the cost associated with 3D elastic FWI.We have validated the most novel of the strategies, called DOC, in complex 2D scenarios and concluded that it does not only result in important computational savings but also allows for obtaining better model fits.This is due to reducing the footprint of the ground roll in the data and using gradients which are very local to the shot locations.The other acceleration techniques include grid size adaptation to frequency and composite shot lists.We remark that all of these strategies are independent of each other but can also be applied simultaneously for maximum gain.It is also important to notice that the improvements are orthogonal to any optimization carried out at the level of the numerical kernel used to solve the wave equation for forward and adjoint fields, but works better when the workflow is flexible enough to accommodate differently sized shots, such as in a masterworker pattern that separates individual shots in different worker tasks.Our example shows that, rather than the theoretical FWI vs modelling 1000-fold increase in cost, our scheme achieves a mere tenfold increase in cost compared to that of modelling synthetic data when both modelling and inversion cover the same frequency range.In particular, we show positive 3D elastic FWI results with a ×53 speedup compared to the time required for the inversion using a convential FWI workflow.

Fig. 1
Fig. 1 Algorithm used for full waveform inversion.The optimization and acceleration features are indicated in colour

Fig. 3
Fig. 3 Three-component shot gather obtained with a Ricker wavelet with central frequency of 10 Hz for a source located at the middle of the domain

Fig. 4
Fig.4 From top to bottom and left to right: the V p and V s inverted models with full offset; V p and V s inverted models with DOC

Fig. 5
Fig. 5 From left to right and top to bottom: V p target model, V s target model, V p inverted model, and V s inverted model obtained with the DOC strategy of Table 1

Fig. 6 V
Fig.6 V p and V s (top to bottom) profiles obtained with the DOC strategy of Table1at range 6, 9 and 12 km (left to right).The inverted, starting and target models are the blue, the dash and the red line, respectively

Fig. 8
Fig.8 Ratio of computational cost with and without DOC as a function of the offset for four frequencies in the 2D case (in blue, left axis) and in the 3D case (in red, right axis)

Fig. 9 Fig. 10
Fig. 9 Target (top) and final velocity models for V p (left) and V s (right) obtained with the DOC strategy ofTable 2 for test cases 1 (middle) and 2 (bottom)

Fig. 12 Fig. 13
Fig. 12 From left to right and top to bottom: V p target model, V s target model, V p inverted model, and V s inverted model obtained with the DOC strategy

Table 1
at range 6, 9 and 12 km (left to right).The inverted, starting and target models are the blue, the dash and the red line, respectively

Table 3
The grid size, equal to the amount of staggered grid cells used for modelling, is 440 × 2201 × 2201 with Δx = Δy = Δz = 7.25 m.We use MPI-based domain decomposition, using 20 nodes for each shot computation.Inside each node, OpenMP takes care of a multi-thread approach to maximize the use of all available cores.We run the whole simulation on MareNostrum3 1 supercomputer resulting in 2.8 TB of synthetic traces.For inversion, we use only the first 55 lines, resulting in 3621 shots.At this point, inverting directly for all the shots is too expensive; hence, we use the composite shot lists strategy proposed by Warner et al.