## 1 Introduction

Seismic velocity of subsurface layers controls wave propagation across them. Figure 1 shows the geometry of a typical seismic survey across a simple 3-layer model. Although rays emanate from the source along all directions, rays crossing layer boundaries (interfaces) must satisfy Snell’s law:

\begin{aligned} \frac{\sin \theta _1}{V_1} = \frac{\sin \theta _2}{V_2}, \end{aligned}
(1)

where $$\theta _1$$ and $$\theta _2$$ are the angles of the incident and transmitted rays, respectively, relative to the normal to the interface, while $$V_1$$ and $$V_2$$ are the velocities of the media in which the incident and transmitted rays propagate, respectively (Sheriff and Geldart 1995).

It can be shown geometrically that the two-way travel time (T) of a ray reflected off any interface is related to the offset (X) between the source and receiver on the surface through the following hyperbolic relation:

\begin{aligned} T^2(X) = T_0^2 + \frac{X^2}{V^2}, \end{aligned}
(2)

where $$T_0 = 2H/V$$ is the zero-offset two-way time and H and V are the depth and velocity, respectively, from the surface to the interface (Sheriff and Geldart 1995). The topmost layer is generally accessible for direct velocity measurement and satisfies the hyperbolic assumption of equation (2). However, deeper layers generally do not satisfy equation (2) due to ray bending. Velocities of deeper layers are found conventionally by assuming small velocity contrasts between adjacent layers so that the hyperbolic assumption can still be used despite the introduction of small errors. The most commonly used methods for velocity determination include: the $$T^2$$-$$X^2$$ and velocity spectrum.

The $$T^2$$-$$X^2$$ method depends on fitting a line to the $$T^2$$-$$X^2$$ data of each reflection and estimating the stacking velocity ($$V_S$$) from the surface to the reflecting interface. The layer’s interval velocity can be calculated from the stacking velocities and times at the top and bottom interfaces of the layer using (Dix 1955) formula. However, this method is generally not suitable for seismic data exploration because it requires picking of the travel times, which may be inaccurate when automated due to noise commonly present in seismic data and not practical if done by experienced geophysicists because it is highly time consuming (Yilmaz 2001).

The velocity spectrum method transforms the data in a common midpoint (CMP) gather from the T-X domain to the $$T_0$$-$$V_S$$ domain. Without picking, a range of stacking velocities is fitted to the T-X data and a goodness-of-fit parameter (e.g., semblance) is calculated at each $$T_0$$. The stacking velocity corresponding to the largest semblance value indicates a reflection from an interface at this $$T_0$$ with this stacking velocity (Taner and Koehler 1969). Once stacking velocities to the top and bottom interfaces of each layer are estimated, Dix formula can be used to calculate the layer’s velocity. Although this method avoids picking, it requires the hyperbolic assumption, which might not be satisfied especially in near-surface layers. Near-surface layers generally exhibit high vertical velocity contrast due to weathering causing rays to bend sharply when crossing an interface. Therefore, the hyperbolic assumption generally fails and more accurate methods that honor ray bending are required.

In an alternative approach, physical systems of the form in Fig. 1 can be represented by mathematical objects called graphs, from which properties of the underlying model can be studied using tools of graph theory. One main goal in spectral graph theory is to deduce principal properties and structure of a graph from its spectrum. See for example (Biggs et al. 1976; Tompa 1980), where eigenvalues of graphs were associated with the stability of chemical molecules. Integration of graph theory with stress testing methodologies to assess the resilience of transportation network topology under the influence of environmental hazards was discussed in Aydin et al. (2018).

The use of concepts of graph theory in the study of seismology has been discussed in a number of works. McBrearty et al. (2019), use graph clusters to study the link between arrivals, earthquakes, and source locations. Graph theory was proposed for efficient seismic tracing by Moser (1989), where seismic point sources are represented as vertices in a graph and connections between the points are assigned weights in the graph representing the travel time of seismic wave along the connection. In Ferreira et al. (2019) weighted undirected graphs were discussed as instruments to aid seismic interpretation. Tools from graph theory have also been used to identify salt dome boundaries in seismic data and to carry out seismic performance assessment of transport systems using graph theoretical concepts of underlying network by Khayer et al. (2022) and Malekloo et al. (2022), respectively. For a historical survey on the applications of graph theoretical tools in geosciences the reader may also consult the review article by Phillips et al. (2015).

In this paper, we develop a method to estimate seismic velocities using concepts of graph theory based on some spectral data of the underlying earth model. The relationships between seismic velocities and eigenvalues of weighted graphs representing the underlying earth models are explored. We develop a regression model for predicting seismic velocities from the largest eigenvalues of graphs of the underlying earth models. The use of the proposed regression model for predicting velocities of previously unseen data is demonstrated and the percentage error of the predictions is discussed as a metric for the model.

This paper is organized as follows. Some preliminary concepts from graph theory are discussed in Sect. 2. In Sect. 3 we discuss the graph theoretic abstraction of the underlying earth model. The data generation approach and regression model are presented in Sect. 4. In Sect. 5 we test the proposed method on synthetic seismic data and some concluding remarks are given at the end of the paper.

## 2 Preliminaries

In this section, we recall some important concepts and terminology which are of fundamental use in our mathematical model. More details on the notions presented here can be found in Brouwer and Haemers (2011), Chung (1997), and Trudeau (2017)

### Definition 1

A graph G is a triple consisting of a vertex set V(G), an edge set E(G), and a relation that associates with each edge two vertices (not necessarily distinct) called its endpoints.

Two vertices connected by an edge are said to be adjacent and they are referred to as neighbors. Let a and b be any two vertices of a graph, if a and b are neighbors, we use the notation $$a \sim b$$ to indicate the adjacency of a and b. In addition, the edge between vertices a and b is denoted ab. A graph is said to be directed if all or some of its edges have directions. Such graphs are referred to as directed graphs (Digraphs). A graph with no directed edge is said to be undirected.

### Definition 2

A weighted graph is one in which every edge is assigned a numerical value called weight.

### Definition 3

A loop is an edge whose endpoints are equal. Multiple edges are those having same endpoints.

A graph with no loops or multiple edges is called a simple graph. In this paper, we consider simple undirected weighted graphs. For simplicity we shall suppress the use of some of these terminologies and simply refer to our graph models as weighted graphs. Next we discuss two important matrix representations of graphs as well as their spectra.

Let G be a graph with n vertices. The adjacency matrix A(G) and weighted adjacency matrix W(G) are $$n\times n$$ matrices whose entries are specified, respectively, as

\begin{aligned} a_{ij} = {\left\{ \begin{array}{ll} 1, &{} \text { if } i \sim j, \\ 0, &{} \text {otherwise}, \end{array}\right. } \end{aligned}
(3)

and

\begin{aligned} w_{ij} = {\left\{ \begin{array}{ll} n_{ij}, &{} \text { if } i \sim j, \\ 0, &{} \text {otherwise}, \end{array}\right. } \end{aligned}
(4)

where $$n_{ij}$$ is the numerical weight assigned to edge connecting the vertices represented by rows i and j.

### Definition 4

The eigenvalues of a graph are the roots of the characteristic polynomial of its adjacency matrix. All the distinct eigenvalues and their multiplicities give the spectrum of the graph.

Thus, if a graph has k distinct eigenvalues $$\lambda _i$$, $$i = 1, 2, \ldots , k$$, each with multiplicity $$m_i$$, then the spectrum of the graph is given by

\begin{aligned} \begin{pmatrix} \lambda _1 &{} \lambda _2 &{} \dots &{} \lambda _k \\ m_1 &{} m_2 &{} \dots &{} m_k \end{pmatrix}. \end{aligned}
(5)

### Definition 5

The spectral radius of a square matrix is the largest absolute value of its eigenvalues.

For a graph, we can talk about the spectral radius of its matrix representations, for instance, spectral radius of the adjacency or of the weighted adjacency matrix. However, in this paper we shall focus mainly on the weighted adjacency matrix. As such, all through the paper the term spectral radius will be reserved for the weighted adjacency matrix.

To illustrate some of the above notions, consider the graph G in Fig. 2. The vertex and edge sets are $$\{\text {a,b,c,d}\}$$ and $$\{\text {ab, ac, ad}\}$$, respectively. So we say vertex $$\text {a}$$ is a neighbor to vertices $$\text {b}$$, $$\text {c}$$ and $$\text {d}$$, because it has an edge to each of them. The edges $$\text {ab}$$, $$\text {ac}$$, $$\text {ad}$$ are assigned weights 5, 2, 2.5 units, respectively. The edges are undirected and the graph has no loops and no multiple edges.

Following the definition of the adjacency matrix, A(G), and weight adjacency matrix, W(G), given above, we have

\begin{aligned} A(G) = \begin{pmatrix} 0 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \end{pmatrix} \end{aligned}

and

\begin{aligned} W(G) = \begin{pmatrix} 0 &{}\quad 5 &{}\quad 2 &{}\quad 2.5 \\ 5 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 2 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 2.5 &{}\quad 0 &{}\quad 0 &{}\quad 0 \end{pmatrix}. \end{aligned}

The eigenvalues of A(G) are $$[ 1.73, -1.73, 0., 0.]$$ and those of W(G) are $$[ 5.94, -5.94, 0., 0.]$$. Thus, their respective spectra are

\begin{aligned} \begin{pmatrix} 1.73&{}\quad 0 &{}\quad -1.73 \\ 1 &{}\quad 2 &{}\quad 1 \end{pmatrix}, \text { and } \begin{pmatrix} 5.94&{}\quad 0 &{}\quad -5.94 \\ 1 &{}\quad 2 &{}\quad 1 \end{pmatrix}, \end{aligned}

and the spectral radius of the graph is 5.94.

## 3 The graph theory approach for seismic velocity determination

The system under consideration consists of seismic sources which transmit signals through reflection and transmission points located on interfaces between subsurface layers. The signals are recorded at the ground surface by receivers (i.e., geophones). This can be thought of as a set of points being connected by lines (i.e., rays).

In this section, we describe the mathematical modeling of the physical problem using weighted graphs. We represent the sources, receivers, and reflection/transmission points as vertices and the signals (rays) between them are represented as edges. Thus, there is an edge between two vertices if and only if there is a ray between them. In the physical model, the distance between any two vertices is controlled by the seismic velocity. As such, our model is built to incorporate the physical distance between any two connected vertices by assigning it as a weight to their corresponding edge. Thus, the resulting model is an undirected weighted graph. Figure 2 can be considered as the graph model of a seismic network with four vertices. Vertex $$\text {a}$$ is 5 (distance units) away from $$\text {b}$$, 2 (distance units) away from $$\text {c}$$ and 2.5 (distance units) away from $$\text {d}$$ as shown in the weights of the connecting edges. No edge between any of vertices $$\text {b}$$, $$\text {c}$$ and $$\text {d}$$ indicates that there is no direct ray between them.

## 4 Data generation and linear regression model

We now give a description of how the data used to build our model was generated, after which we discuss the predictive model itself.

### 4.1 Data generation

In the physical model we rearrange the seismic data into common mid-point (CMP) gather as shown in Fig. 1. Here, we are searching for a raypath that initiated at the source point (S), reflected at the midpoint (m), and received at the receiver point (R). This raypath must satisfy Snell’s law (Eq. (1)). That is, we are looking for the incident angle ($$\theta _1$$) that satisfies the CMP raypath and, at the same time, honors Snell’s law. Equation (6) is a rearrangement of Eq. (1) to find the value of $$\theta _1$$

\begin{aligned} \sin (\theta _1)= \frac{V_1}{V_2} \sin (\theta _2). \end{aligned}
(6)

As shown in Eq. (6), the value of $$\theta _1$$, which defines the shape of the raypath, is a function of both $$V_1$$ and $$V_2$$.

Our major interest is to generate data which can facilitate the prediction/calculation of the seismic velocity $$V_2$$ of the second (deeper) layer. For this purpose, we consider a simulation of the physical system to generate synthetic values of the velocity. In the simulation, a value of the velocity, $$V_1$$, of the first (upper) layer is assumed. In addition, distance between sources and receivers as well as layer thicknesses are assigned. Then, a range of values for $$V_2$$ is assumed. These values start at $$V_1 + 10\%V_1$$, as commonly expected in near-surface layers (e.g., dry over saturated soils, weathered layer over bedrock, etc.), and increases uniformly up to multiples of $$V_1$$. These obtained velocities are then considered as reference values which we later attempt to predict.

To achieve our goal of building a predictive model for the seismic velocity $$V_2$$, we also require to have at least one prediction variable (predictor). For this purpose, we turn to the weighted graph model used to abstract the physical system being simulated. In the graph model, the weight of any edge corresponds to the physical distance between the two vertices along the ray represented by the edge. The model is constructed as described in Sect. 3. Following the construction of the weighted graph, we investigate spectral properties of its weighted adjacency matrix.

### 4.2 The regression model

We begin with a brief description of our data exploration activities. To identify a suitable regression model for estimating the velocities, we first explore the relationship between the values of $$V_2$$ and the spectral data (the different eigenvalues and their multiplicities) of the graphs corresponding to the earth models for which data was generated in Sect. 4.1. As expected, it is observed that the spectral radius has the most significant relationship with $$V_2$$. Figure 3 shows that there is a polynomial relationship between the velocity at the second layer ($$V_2$$) and the spectral radius of the graphs representing the underlying earth models.

Having identified the spectral radius as a possible predictor for the seismic velocity $$V_2$$, we need to find a suitable predictive model. Given that we have an estimation problem with numerical predictor and response variable, we are motivated to study the possibility of a regression relationship. Our exploratory analysis revealed that a quartic polynomial regression gives better data fitting and generalization (i.e., estimation of data not previously used to train the regression model) than its quadratic and cubic counterparts. Consequently, we propose the quartic polynomial regression for computing an estimate $${\hat{v}}_2$$ of $$V_2$$:

\begin{aligned} {\hat{v}}_2 = \beta _0 + \beta _1 x + \beta _2 x^2 + \beta _3 x^3 + \beta _4 x^4, \end{aligned}
(7)

where x denotes the spectral radius of the weighted graph representation of the underlying earth model and $$\beta _0, \beta _1, \beta _2, \beta _3$$ and $$\beta _4$$ are the model parameters to be learned from available data.

The problem of finding the values of the parameters $$\beta _0, \beta _1, \beta _2, \beta _3$$ and $$\beta _4$$, is a discrete inverse problem for which we need some input and output data x and $$V_2$$, respectively. Using the approach discussed in Sect. 4.1, two hundred and ninety one (291) models were generated containing ray path matrices, and their corresponding values of $$V_2$$ with $$V_1 = 500\text {m/s}$$, among other information. In particular, the models generated have the values of $$V_2$$ (in $$\text {m/s}$$) from $$V_2 = 550$$, to $$V_2 = 2000$$ with a $$5\text {m/s}$$ increment, that is $$\{550, 555, 560, \ldots , 2000\}$$. Seventy percent (selected randomly) of these generated data are used for training the model while the remaining thirty percent are used for testing purposes.

To recover the model parameters in the regression model (7) from the training dataset, we form an equation for each entry in the dataset by putting the value of x and the actual value of $$V_2$$ in the regression equation. If we have a dataset of size l, then we end up with a system of equations given in the matrix form

\begin{aligned} {\textbf{A}} {\textbf{b}} = {\textbf{v}}_2, \end{aligned}
(8)

where

\begin{aligned} A =\begin{pmatrix} 1 &{}\quad x_1 &{}\quad x_1^2 &{}\quad x_1^3 &{}\quad x_1^4 \\ 1 &{}\quad x_2 &{}\quad x_2^2 &{}\quad x_2^3 &{}\quad x_2^4 \\ 1 &{}\quad x_3 &{}\quad x_3^2 &{}\quad x_3^3 &{}\quad x_3^4 \\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ 1 &{}\quad x_l &{}\quad x_l^2 &{}\quad x_l^3 &{}\quad x_l^4 \end{pmatrix}, {\textbf{b}} = \begin{pmatrix} \beta _0 \\ \beta _1 \\ \beta _2 \\ \beta _3 \\ \beta _4 \end{pmatrix}, \quad {\textbf{v}}_2 = \begin{pmatrix} v_{2,1} \\ v_{2,2} \\ v_{2,3} \\ \vdots \\ v_{2,l} \end{pmatrix}, \end{aligned}

$$x_i$$ is the i-th entry of x (the i-th spectral radius) in the training dataset and $$v_{2,i}$$ is the i-th value of $$V_2$$ in the training dataset. Solving the above system by least squares, we get the best possible approximation of the model parameters for fitting the given data. The values recovered are

\begin{aligned} \beta _0= & {} 3373502.64975747, \quad \beta _1 = -3.81152576e+04, \beta _2 = 1.61376620e+02, \\ \beta _3= & {} -3.03449337e-01, \quad \beta _4 = 2.13859555e-04. \end{aligned}

The whole dataset as well as the training and testing partitions can be found on github, see the data availability statement for the URL. In addition, there are two zipped folders. One containing the generated models in.mat format and the other containing the extracted ray path matrices and velocities in an Excel format.

In our computations the percentage error (residual), denoted $$P_e$$, in using the regression model (7) to estimate the velocity $$V_2$$ is computed by

\begin{aligned} P_e = \frac{|V_2 - {\hat{v}}_2|}{|V_2|} \times 100. \end{aligned}
(9)

In Fig. 4, we present the results obtained when the regression equation (7) is applied to the data used for training the model (learning the parameters of the regression model). The plot on the left shows the matching of the target and estimated values while plot on the right shows the percentage error (9). It is seen from the figures that the estimates obtained from the regression model match the true values of $$V_2$$ in the training data and the observed percentage error is largely smaller than $$4\%$$.

In Fig. 5, we present the results obtained by applying the regression model to another dataset, different from the one used to learn the parameters of the model. Such dataset is called test data. It is seen that model also performs well in estimating the target ($$V_2$$) in the previously unseen data. The plot on the left shows that the estimations obtained from the model match the target values reasonably well. And the plot on the right reveals that percentage error in the estimation is also largely below $$4\%$$. These observations highlights the generalization tendency of the model which testifies to the ability of the model to be useful for future predictions.

## 5 Model validation

A synthetic seismic data is generated to test the proposed method, where the true velocity model is consisting of 3 layers. The velocities of the 1st, 2nd, and 3rd layers are 500 $$\text {m}/\text {s}$$, 1500 $$\text {m}/\text {s}$$, and 2200 $$\text {m}/\text {s}$$, respectively, while the thickness of the first and second layers are 30 $$\text {m}$$ and 50 $$\text {m}$$, respectively. Three common shot gathers are generated at offsets 0, 100, and 200 $$\text {m}$$ located at the ground surface. Then the CMP gather located at offset 250 $$\text {m}$$ is extracted, using the three receivers at offsets 300, 400, and 500 $$\text {m}$$ (Fig. 6).

To process the data, we calculated both $$V_1$$ and thickness of the first layer from the recorded shot gathers, then we applied the graph theory on the raypath model. Here, source points $$\text {S}_1$$ to $$\text {S}_3$$, receiver points $$\text {R}_1$$ to $$\text {R}_3$$, transmission points $$\text {r}_1$$ to $$\text {r}_6$$, and reflection point M are considered as vertices, while the raypaths $$\text {S}_1$$-$$\text {r}_1$$, $$\text {r}_1$$- M, M-$$\text {r}_6$$, etc. are considered as edges. The location of the vertices $$S_1$$ to $$S_3$$, $$R_1$$ to $$R_3$$, and $$\text {M}$$ are fixed, while the location of the vertices $$\text {r}_1$$ to $$\text {r}_6$$ and M depends on the velocities $$V_1$$ and $$V_2$$ and thicknesses $$\text {H}_1$$ and $$\text {H}_2$$. Since we usually know the value of $$V_1$$ from the direct-wave data, graph theory will be used to find $$V_2$$.

In a real situation, the locations of $$\text {r}_1$$ to $$\text {r}_6$$ are unknown, so we assume all possible locations taking into consideration the following

1. (1)

Offset values of $$\text {r}_1$$ to $$\text {r}_3$$ are less than offset value of M.

2. (2)

Offset values of $$\text {r}_4$$ to $$\text {r}_6$$ are larger than offset value of M.

3. (3)

$$\text {r}_1$$ to $$\text {r}_6$$ are arranged in ascending offset, (i.e. offset of $$\text {r}_2$$ is larger than offset of $$\text {r}_1$$, etc).

Following this procedure, we generated a set of $$V_2$$ values using the graph theory approach by calculating the spectral data for each attempted model and using our regression model (7) to estimate the corresponding $$V_2$$ value. We find the correct $$V_2$$ value by calculating the total travel times from $$\text {S}_1$$ to $$\text {R}_1$$, $$\text {S}_2$$ to $$\text {R}_2$$, and $$\text {S}_3$$ to $$\text {R}_3$$ for each attempted model. The calculated times are then compared to their corresponding $$\text {S}_1$$ to $$\text {R}_1$$, $$\text {S}_2$$ to $$\text {R}_2$$, and $$\text {S}_3$$ to $$\text {R}_3$$ travel times measured from the data. The model with the minimum error between the calculated and measured travel times corresponds to the correct $$V_2$$ value.

To validate the proposed technique, we generated three common shot gathers representing $$\text {S}_1$$, $$\text {S}_2$$, and $$\text {S}_3$$ shown in Fig. 6. The velocity model shown in Fig. 6 is used as the true velocity model. Figure 7 shows one shot gather example of the data generated using a finite-difference solution to the acoustic wave equation. Figure 8 shows the common mid-point extracted from the generated three shot gathers to test the graph theory approach. The value of $$V_2$$ calculated using the graph theory approach is 1509 $$\text {m}/\text {s}$$ while the true value is 1500 $$\text {m}/\text {s}$$, which shows an error of only 0.6%. This result reflects the accuracy of the proposed graph theory approach to find seismic velocities in similar near-surface settings.
We presented a novel approach for estimating seismic velocities in common near-surface settings using graph theory. The proposed approach starts by defining the locations of sources, receivers, transmission and reflection points as vertices and ray segments between them as edges. Moreover, the coordinates (depths and offsets) of these vertices are used to calculate weights for the edges. This construction results in an undirected weighted graph model for which spectral data can be calculated. Then we performed some exploratory data analysis on a common near-surface setting to uncover possible relationships between the spectral data and seismic velocities. Based on empirical evidence, it was observed that the largest eigenvalues are the most suitable predictors for the seismic velocity. In fact, these eigenvalues were observed to be approximately some quartic polynomial of the velocity $$V_2$$. Furthermore, we test the proposed on synthetic seismic data involving a simple near-surface velocity model. The proposed method was able to predict the velocity of the second layer with a high accuracy of $$99.4\%$$, which testifies to its suitability to estimate subsurface seismic velocities accurately.