1 Introduction

Ground penetrating radar (GPR) systems are popular for detecting and identifying various types of buried objects underground as well as for structural analysis of materials and mapping applications [1,2,3]. GPRs can be operated manually by a human or can be deployed on host platforms. Recent developments in the Size, Weight and Power (SWaP) aspects of hardware components and increasing sophistication of associated software systems has led to proliferation of autonomous or semi-autonomous host platforms. In particular, widespread use of rotary UAVs with increasing payload capacity is likely to introduce new opportunities for combining agile airborne host platforms with GPRs as their payloads, as an attractive combination to address challenges concerning effective coverage of large areas.

Fig. 1
figure 1

A typical operation sequence using UAV-mounted GPR

As illustrated in Fig. 1, a typical GPR operation sequence for object detection applications involves a series of processing steps, mainly including the generation of A-Scans (that is a 1D signal train) on reception of the transmitted signal, then composing a 2D representation out of those A-Scans along the moving direction (called a B-Scan), and then performing further signal processing steps to obtain a cleaner signature for target detection and identification. The depth of the buried targets, the material they are made of and the electrical properties of the environment they are buried in are all factors influencing the performance of the GPR. One important aspect is that radar transmission and reception are toward and from ground, and range in that direction is usually quite limited especially when the resolution is important and thus the operating frequency is high.

One of the main requirements in a typical mission planning using airborne host platforms, on the other hand, is the specification of an optimal search pattern. Although for small areas and for areas with anticipated object layouts this may be less of an issue, for large areas, such as those indicated in Fig. 2, with unknown sparse object layouts, this can be a major target for optimization. The typical approach to such planning problems is to devise a coverage path that aims to span the mission area with a certain resolution [4,5,6,7]. The central focus of the solutions provided for area sweep applications is to devise an efficient path that ensures full coverage of the area while avoiding any obstacles. Furthermore, the functionality of the onboard sensors in conjunction with path planning and navigation of the host platforms are obstacle avoidance, area mapping and situational awareness. We argue that, although finding buried objects using an airborne GPR system in a large area can be viewed as a Coverage Path Planning (CPP) problem, there are certain aspects that distinguish it from a typical CPP setting:

  • As indicated in Fig. 2, the buried objects are normally quite sparsely disseminated to the search area in question, so the main focus is to efficiently detect such objects rather than making sure that all the area is visited with high resolution.

  • The onboard sensor (i.e., GPR) of the search platform makes it possible to sense initial prior signal levels to indicate that an object may be in the vicinity of the current position. This is a slightly different case that calls for a more intrusive role for the sensor, as a source of input for path generation algorithms.

  • Once an indication of a possible object discovery is encountered, a more detailed scan procedure should be initiated to generate proper signal patterns for robust detection and identification (if possible) of the objects. Since performance of the GPR is tightly coupled to the quality of target signature formation process, a more strictly controlled target detection process is likely to introduce rigor into the overall operation.

Fig. 2
figure 2

An illustration of typical mission requirements

To put the problem into a proper perspective based on the points listed above, it helps to view the whole operational procedure as two distinct phases: The first phase involves a somewhat high-level search; followed by a second phase, namely the target detection phase, which is characterized by a more precise scan pattern. This two phase cycle would normally repeat itself for each object being detected (or a group of objects if they are closely located). Note that, the first phase is more akin to a conventional CPP problem setting, except that following a more coarse search pattern whose resolution can be adjusted by relying on the detection range of the GPR is preferred. The second phase, then, would be driven by the GPR to generate a movement path that should result in successful detection and hopefully identification of targets in question. We note that, for such sensor-based path planning and/or target detection problems, machine learning-based methods have been on the agenda recently. In particular, reinforcement learning (RL) approach appears to be gaining popularity as an adaptive optimization methodology [8,9,10,11,12].

As such, in this paper, we introduce a method that aims to illustrate the viability of the approach outlined above, using a high fidelity simulation environment. The central idea of our method is that for areas with sparse object layouts, the host platform navigates the area with a relatively coarse pattern (compared to typical coverage path planning approaches such as grid-based ones), and when the signal processing pipeline of the GPR triggers a signal level above a certain noise threshold optimized to a certain application area, the host platform switches to a mode where navigation commands are received from an agent model trained using reinforcement learning. The work presented in this paper is an extended and improved version of our previous work reported in [13]. Our contributions in this paper are as follows:

  1. 1.

    An efficient coarse-grained search pattern for the first phase mentioned above.

  2. 2.

    A RL framework with a novel training approach to implement the optimization of the second phase mentioned above.

  3. 3.

    A comprehensive methodology illustrating seamless combination of both approaches for a typical GPR-based search mission.

  4. 4.

    Implementation of a full-fledged experiment framework that includes an extensible wrapper for a high fidelity GPR simulator (namely GprMax), a complete RL environment emulating the relevant signal processing and decision logic of a typical GPR, a host-platform simulator wrapper for a rotary-wing autonomous drone, a buried object layout generator for mission areas of arbitrary sizes and layouts, and a convenient set of tools for logging, visualizing and plotting the results.

  5. 5.

    Demonstration of the viability of our approach via extensive experimentation.

  6. 6.

    Ratification of the improvements of our method over a typical coverage path planning approach.

  7. 7.

    We also provide a time complexity analysis of our hybrid algorithm to illustrate that the method has a linear time complexity with respect to the growth in area size and the number of targets to be discovered.

In a nutshell, our solution is applicable to rotary wing UAVs with onboard GPR systems, targets out-door environments with sparsely distributed buried objects to be discovered, and provides a hybrid method combining a heuristics-based off-line CPP and an online optimization technique.

The rest of the paper is structured as follows: Sect. 2 provides some background on GPR signal processing, path planning and reinforcement learning, Sect. 3 introduces the details of our methodology, Sect. 4 presents results and discussion, and finally, Sect. 5 provides some concluding remarks.

2 Background and related work

2.1 Reinforcement learning

Informally, reinforcement learning can be stated as learning from interactions with an environment to achieve a goal. The learner and decision maker are called the agent. The surroundings it interacts with (i.e., everything outside the agent) are called the environment. The environment provides feedback to the agent in the form of rewards in response to agent’s actions and updates its state which may be available to the agent fully or partially. Rewards are special numerical values that the agent strives for maximizing over time through its choice of actions. The primary framework used for a formal treatment of this narration is the Markov Decision Process (MDP) which has mainly been adopted by the machine learning community [14, 15].

In a set theoretical setting, an MDP is formally defined (see, for instance,  [16]) as a tuple of five elements \(\left \langle S,A,T,R, \gamma \right \rangle\) where S is a finite set of states defining a discrete state space, A is a finite set of actions defining a discrete action space, T a transition function defined as \(T: S\times A \times S \rightarrow [0,1]\), R is a reward function defined as \(R: S \times A \times S \rightarrow R\) and \(\gamma \in (0, 1)\) denotes a discount factor. The transition function T and the reward function R together define the model of the MDP. Given an MDP, a policy is a computable function that outputs for each state \(s \in S\) an action \(a \in A\) (or \(a \in A(s)\)). Formally, a deterministic policy \(\pi\) is a function defined as \(\pi : S \rightarrow A\). It is also possible to define a stochastic policy as \(\pi : S \times A \rightarrow [0,1]\) such that for each state \(s \in S\), it holds that \(\pi (s,a) \ge 0\) and \(\sum a \in \pi (s,a) = 1\).

In our framework, we implement the agent using one of the most popular model-free, Q-Learning-based RL algorithms, namely Deep Q Network (DQN) which was developed by [17]. DQN uses neural networks as a nonlinear approximator for the Q-function. In fact, a DQN is a multi-layered neural network that maps a vector of observations, \(\omega _t \in \Omega\), to a vector of action values \(Q(\omega _t,.; \theta )\), where \(\theta\) are the parameters of the network.

Mhih et al. [17] introduced two novel mechanisms that address the instability of standard Q-learning algorithm. The first is termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution. The second involves the use of a target network, a separate neural network (architecturally identical to the value network), with parameters \(\theta ^-\) that are copied every k steps from the online network, so that then \(\theta ^{-}_t = \theta _t\), and kept fixed on all other steps. To train the DQN, experiences are first accumulated in a memory (i.e., experience replay buffer) using \(\epsilon\)-greedy policies. In this setting, for a state \(s_t\), an action \(a_t\) is taken randomly with probability \(\epsilon _t\), and greedily with respect to the current DQN with probability \(1-\epsilon _t\). Then, when the memory is long enough (i.e., when the replay buffer size reaches a predefined value), at every time slot t, a minibatch of B experiences \(\{(s_i,a_i,r_i,s^\prime _i)\}_{i \in B_t}\) is randomly sampled from the memory. Here, \(B_t\) is a random subset of the experience indices currently available in the memory. The weights \(\theta\) of the DQN (i.e., \(Q_\theta\)) are then updated to minimize the loss function given in Eq. 1:

$$\begin{aligned} {L_t}\left( {\mathbf {\theta }} \right) = \frac{1}{B}\sum \limits _{i \in {\mathcal {B}_t}} {{{\left( {{r_i} + \gamma \mathop {\max }\limits _a {Q_{\mathbf {\theta }}}\left( {s_i^{^\prime },a} \right) - {Q_{\mathbf {\theta }}}\left( {{{\textbf{s}}_i},{a_i}} \right) } \right) }^2}} \end{aligned}$$
(1)

In the original DQN, experience replay buffer is sampled uniformly. We use a variant called prioritized experience replay which is introduced by  [18]. The key idea here is that an RL agent can learn more effectively from some transitions than from others, in particular from those that are more surprising and/or task-relevant. More specifically, this technique tends to favor transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity and can introduce bias, but these are alleviated with stochastic prioritization and importance sampling, respectively.

2.2 CPP using RL

Area coverage is generalized as a completely or partially enclosed area with a non-overlapping path by autonomous platforms. CPP algorithms can be online or off-line, depending on the prior knowledge of the surrounding environment  [4]. The CPP is generally based on global sequential point-to-point coverage, where the agent follows the route besides obstacle avoidance on the given map. Many algorithms have been introduced for both online and off-line scenarios addressing in-door and out-door settings.

The use of RL in conjunction with CPP, on the other hand, is relatively recent. To the best of our knowledge, there is not any reported work that specifically aims at optimizing coverage paths for GPR-driven target detection tasks, using an airborne host platform in an RL setting. Most of the research seems to focus on the CPP of in-door environments with dynamic obstacles, in particular for cleaning robots. Notably, DongKi et al. [19] develop an adaptive policy for a complete CPP applications for cleaning robots, using RL in 2D grid map. They divide the whole region to be cleaned into sub-regions by using a watershed segmentation algorithm and connect sub-regions by using a general A* algorithm. A* algorithm determines the visiting order of sub-regions. This part of their approach is similar to our approach in that the RL is employed only for efficient navigation of the sub-regions. They use a color encoded 2D image representing the current state in each sub-region, which becomes one of input to the neural network modeling the agent behavior, in addition to a numerical vector encoding the pose (position and a turn angle) of the agent. Since the goal of the optimization is only efficient coverage of sub-regions, their reward function is quite simple. This seems to bring about expanding horizons (an indication of sparse reward problem), forcing them to resort to transfer learning techniques. Authors [20, 21] develop a RL-based complete CPP solution for variable-morphology cleaning robots (as opposed to fixed morphology roots) and report improvements over conventional CPP algorithms. One distinct variation in their RL scheme is that both papers use a LSTM network (a form of recurrent neural networks) as part of the agent model (in addition to convolutional neural network layers used for feature extraction). This CNN-LSTM network architecture incorporates the ability to account for a short history of actions (due to LSTM part) and is used for estimating not only the next move of the robot, but also its appropriate morphology (i.e., tiling configuration).

The work reported by Bialas et al.  [22], on the other hand, resembles our work in that they aim to optimize CPP for a UAV. However, their main focus is to achieve the area coverage for disaster management purposes. From a technical point of view, their observation space is a combination of two 2D images (one for local and one for global map), a 3D image representing a local 3D map from the agent’s perspective and a numeric value indicating the movement budget of the agent. This observation space requires a relatively complex neural network architecture to handle feature extraction tasks in addition to optimal action selection. The reward function they use is relatively simple, designed to encourage smooth navigation and discourage collision with obstacles and penetration into no-fly zones. They use an actor-critic-based algorithm to train the agent. Their solution to the sparse reward problem for large environments with complex observation spaces and a simple reward function is not documented. Similarly, Haiyang et al. [23] propose a CPP algorithm based on Deep RL to handle the search area coverage problem for UAVs equipped with Synthetic Aperture Radar (SAR) sensors (SAR-UAV). But, as opposed to our approach, they do not use the airborne sensor (i.e., SAR) as an integral part of the RL-based optimization process. Their problem formulation solely aims efficient coverage of the area, with no intention of enhancing the target (or Object of Interest (OOI)) discovery performance of the mission via sensor-driven trajectories. They follow the popular pattern of using a 2D situational map of the agent in its environment as the observation space, accompanied by a simple reward function and a cellular-grid-based movement scheme as the action space.

Perhaps one of the most relevant work is that of Ai et al. [24] where RL is used as a method to optimize the coverage of a maritime area for search and rescue missions. In this context, the objects to be discovered and rescued are analogous to our targets. Not only that, the model employed for encoding the search space relies on a combination of parameters such as probability of containment, probability of detection, and obstacle localization information which are obtained applying some off-line processing techniques to eventually assemble what they call a maritime search and rescue environment model, which is a 2D grid directly in the feature space. In our view, this is elegant and quite similar to our observation space except that our model uses a more refined and thus a more concise feature vector, leading to a leaner architecture for the function approximator (i.e., the neural network modeling the agent behavior). Another difference is that we combine a conventional search path with an RL-driven target discovery trajectory in a hybrid scheme, whereas their entire CPP scheme is RL driven. One observation worth alluding to, in this context, is that although they employ reward shaping (a technique used to alleviate the sparse reward problem) to contend with the delayed reward problem, for very large areas where drift prediction models fail to produce sufficiently dense data, the corresponding model performance is not very clear.

It is important to note that none of the approaches reported above does not involve the combination of a parametric search path with sensor-triggered and RL-driven navigation sub-paths, in the way proposed by this paper.

3 Methodology: a two-phase search and target detection scheme

Our method combines a coarse search pattern with the ability of a machine learning model trained for object detection using reinforcement learning (RL), based on a concise feature set obtained from the typical signal processing pipeline of a GPR. We capitalize on the fact that GPR movement along a path generates a sequence of 1-dimensional (1D) signal trains termed as A-Scans. A window of A-Scans is then combined to obtain a 2-dimensional image called a B-Scan. Using background subtraction and cross-correlation calculations over a window of appropriate size, it is possible to obtain a concise 1D signature directly in the feature space, that can be used to train an agent for object detection in a reinforcement learning (RL) setting. We illustrate our method via a well-known high-fidelity GPR simulation package, namely GPRMax [25], combined with an implementation of a novel reinforcement learning framework based on Deep-Q Network (DQN) algorithm [17], with an extension called Prioritized Experience Replay (PER) [18]. In the following subsections, we provide the details of the proposed method.

3.1 The design of the search pattern

Since the area under investigation for target discovery can be quite large, it will have to be considered as a set of connected polygonal Regions of Interests (ROI). Fortunately, the CPP literature is quite rich in terms of area decomposition techniques as reported in [26, 27], for instance, and coded solutions are also readily available such as those given in  [28]. Therefore, we assume that the whole area is initially decomposed into rectangular or trapezoidal sub-regions using a well-known approach such as trapezoidal-decomposition. We also assume that top level navigation between those sub-regions is driven via a connected graph structure obtained using a recognized optimization algorithm such as A* [27]. The novelty of our method lies in its provision of an efficient navigation approach for the coverage of and target discovery in the aforementioned sub-regions; therefore, we will focus on this particular part of the general solution.

Different approaches for a coarse-grained search path-driven navigation inside those sub-regions can be adopted. We devise a sine wave-inspired search path design on the basis that signal returns from the buried objects would allow coarser navigation patterns. We adopt a simple parametric path generation scheme given below:

$$\begin{aligned} D_m&= (X,Y) \end{aligned}$$
(2)
$$\begin{aligned} A&=\frac{Y - 2\alpha }{2} \end{aligned}$$
(3)
$$\begin{aligned} \mu&= 2A-y_i+\frac{\alpha }{4}; \text {for} \ \ P\left( 0, k\right) \end{aligned}$$
(4)
$$\begin{aligned} \phi&= \frac{\text {Arcsin}\left( \frac{\mu - A}{A}\right) }{\omega }; \omega = \frac{2 \pi }{B} \end{aligned}$$
(5)
$$\begin{aligned} P(x_i)&= y_i = A \times \text {Sin}\left( \omega \left( x_i-\phi \right) \right) +A + \alpha \end{aligned}$$
(6)

where \(D_m\) is a tuple defining the mission area dimensions, \(\alpha\) is an off-set along the Y dimension of the area to ensure the path is not tangent to area borders, A determines the span of the path along y-axis of the mission grid (akin to the amplitude of a sine wave), \(\omega\) determines the coverage resolution of the path (akin to the frequency of a sine wave), \(\phi\) determines the entry or start point of the path (akin to the phase of a wave).

Fig. 3
figure 3

Two example paths generated using the parametric equations given in the equations on the left

As it can be observed, the search path generation logic exploits the expected range information of a GPR (which should be known by design or obtained from the data sheet of the GPR), and characteristics as well as constraints of the mission. As such, once the dimensions of the mission area, the expected forward range, the initial entry off-set to the region and the area boundary off-set parameters are known, a coarse-grain search pattern can be generated to specify way-points for the navigation sub-system of the host platform. In other words, for a specific entry point, \(P\left( 0, k\right) ,\ \ \phi\) can be specified using Eqs. (4) and (5); and all the subsequent path points \(P(x_i)\) can be derived using Eq. (6). Figure 3 illustrates how these parameters relate to the equations more clearly. To illustrate the leverage offered by our method over conventional grid-based coverage path approaches, we provide a visual comparison of those two in Fig. 4a, and a plot of the path length growth rates for increasing sizes of mission areas. Here, the path lengths are calculated using piece-wise Cartesian distances between path points such that, for a path \(S = \left \{ p_0, p_1,\ldots , p_n \right \}\) consisting of n path points, the path length \(\tau\) is defined by the following equation:

$$\begin{aligned} \tau \left( S \right) =\sum \limits _{i=1}^{n} \sqrt{ \left( {p_{i_x} - p_{{i-1}_x} } \right) ^2 - \left( {p_{i_y} - p_{{i-1}_y} } \right) ^2 } \end{aligned}$$
(7)

Paths are generated for 15 different rectangular areas by increasing the length and width of a base area of (14,090) units by %20 in each case.

Fig. 4
figure 4

a A visual comparison of typical grid-based coverage path and our search path, and b a plot of the growth rate of the path lengths for increasing sizes of mission areas

Here, it is assumed that for the conventional case, the host platform does not take the advantage of switching to a sensor (i.e., GPR) guided secondary path. In other words, the sensor is only used when and if the sensor passes over object locations, while strictly following the pre-defined path. Whereas in our case, the sensor is used as a source of navigation in the second phase of the two-phase path scheme, as well as target detection. Therefore, with a moderate assumption (based on typical object sizes and GPR ranges), in the conventional case, depending on the antenna pattern, the path resolution must be at least half of the GPR detection range to ensure unambiguous target signatures, in contrast to twice the GPR range which is the case for our method. This is because in our method, the actual target detection phase is triggered by the signal processing pipeline of the GPR based on its detection range and then proceeds with a specific path toward the central region of the target. This particular path is learned via reinforcement learning, based on typical signal pattern formations on approach paths to targets, and thus ensures a more proper signature construction to facilitate a successful detection. This explains the expanding difference between the path length growth rates. This rate directly translates into a reciprocal growth in various types of costs such as power consumption costs, operational costs and time delay.

3.2 Formulation of the reinforcement learning problem

In a RL setting, an agent is trained to find an optimal (or near optimal) policy based on the feedback provided from an environment it is interacting with. In response to the action of the agent, the environment generates a meaningful feedback signal (i.e., a reward) and an updated state, in the form of an observation vector capturing all the important information essential for correlating the reward to the action provided by the agent. So, Agent–Environment interactions proceed in the form of a loop that is expected to result in a learned optimal policy to achieve a predefined goal (i.e., target detection in our case). Figure 5 depicts a pictorial representation of primary modules and the interactions between an agent and its environment, in the context of our problem setting. Based on these considerations, it is evident that a problem formulation in RL setting requires the specification of three important components:action space, observation space and the reward signal. We proceed by specifying these components in the following sections.

Fig. 5
figure 5

Agent–environment interactions for UAV Path optimization problem

3.2.1 Observation space

An observation for the agent, \(\omega _t \in \Omega\), is the observable part of the environment state by the agent at time step t. \({\omega }_t\) is defined as a tuple as below:

$$\begin{aligned} U_i&= (u_i,v_i,w_i) \end{aligned}$$
(8)
$$\begin{aligned} \beta _i&= <e_1, e_2, \ \dots , e_n>\end{aligned}$$
(9)
$$\begin{aligned} \omega _t&= < \parallel U_i \parallel , \parallel \beta _i \parallel > \end{aligned}$$
(10)

where \(U_i\) is the unit vector specifying the current direction of the GPR host platform, \({\beta }_i\) is an n-tuple representing the buffer that holds the cross-correlation values (i.e., reward signals in our case) for a time window of size n. The \(\parallel . \parallel\) symbol denotes normalization.

3.2.2 Action space

The actions in our problem setting are defined in terms of steering input, \(\alpha\), to the GPR host platform. We adopt a discrete action space which is defined as follows:

$$\begin{aligned} A=\{{\alpha }_{{1}},\ldots ,{\alpha }_{{5}}\} \quad u_i{\in }{[-90,\ 90]} \end{aligned}$$
(11)
$$\begin{aligned} {\alpha }_i={\alpha }_{i-1}{+45,} \end{aligned}$$
(12)

where A is the set of actions consisting of 5 elements (i.e., the size of the discrete action space is 5). The first element is equal to -90, other elements are in increasing order with an increment of 45, and the last element is equal to 90. A positive value indicates a right turn, and a negative value indicates a left turn.

3.2.3 Reward function design

We design a reward function that seek to provide detection signal feedback at the vicinity of a target. This signal is generated via the cross-correlation function using the last two consecutive A-Scans of the time-window as the input. The signal is the primary component of the aggregate reward function that drives the agent training, but there are other components. The aggregate reward signal, \(R_t\), generated by the environment at a specific time step t, is defined by the following equations:

$$\begin{aligned} R_{xc}&= \ \parallel c_k \parallel \\ PN&= \left\{ \begin{array}{ll} 0, &{} {\text {if}} \ \ R_\text {xcorr} < \xi \\ -10 &{} {\text {if a target rediscovered}} \end{array} \right. \end{aligned}$$
(13)
$$\begin{aligned} R_t&= \left\{ \begin{array}{ll} -5, &{} {\text {if GPR out of area}} \\ \alpha RW_{g}, &{} {\text {if a target is found}} \\ {R}_{xc} \ + PN, &{} {\text {otherwise}}\ \ \end{array} \right. \end{aligned}$$
(14)

where \(R_{xc}\) is the cross-correlation value at specific point in the mission grid at a certain time step, as defined in Eq. 5, PN is the penalty given if a certain target is revisited and rediscovered again. This value is zero (0) if the visited location has no cross-correlation value or the value is under a certain threshold level \(\xi\); \(RW_g\) is the reward for discovering a target, enhanced by a coefficient \(\alpha\), allowing for increasing or decreasing the target discovery reward depending on the experiment design.

Fig. 6
figure 6

A color coded visualization of the reward function

Fig. 7
figure 7

3D representation of the reward signals for the same layout

To better illustrate the effect of reward signal to direct the agent toward the goal we employ a color-coded visualization scheme in 2D as depicted in Fig. 6. The figure shows a mission area harboring 9 targets, with varying size, material and burial depth levels. The circular objects indicate the targets (or objects) to be detected by the GPR. Brighter spots in the overlayed image indicate that the reward values are higher. This image was created using GPRMax simulations, running the platform hosting the GPR antenna from left to right, where each run creates a single B-Scan, containing 135 A-Scans along the x-axis. The number of B-Scans along the y-axis is 60. Figure  7 shows a more comprehensible 3D view of the reward grid, revealing the relative strength of the reward signals corresponding to targets in the area.

3.3 Target detection phase and a high level algorithm of the method

At this point, it is important to note that in the RL environment we implement the logic required to understand if a target is encountered based on incremental change in the signal levels. This is achieved via the implementation of a so-called smart buffer that manages a sliding window on the cross-correlation values obtained during the movement of the GPR. As illustrated in Fig. 8, while those values are accumulated, the buffer manager checks to see if a proper hill is formed in the window (and thus a target is detected) using the signal values above a certain threshold, at a particular time. If so, then this is reflected in the aggregate reward value by including an additional positive number as a prize for target detection. The window size can be configured and acts a hyper-parameter to determine the legitimate extent (or stretch) of the hill-shaped signal signature.

Fig. 8
figure 8

A detailed illustration of reward value generation and target detection logic

Algorithm 1
figure a

Pseudocode for high level algorithm of the method

Note that, unless some boundary conditions such as successful target detection, getting out of area and anomalous target detection are encountered, the reward value at a certain time step is equal to the cross-correlation value generated by the signal processing pipeline of the GPR at that time step. This process is applied both during the training and evaluation (or actual deployment) of the DQN agent. As such, assuming an agent is properly trained and is deployed as an AI model during a target search procedure, a high level algorithm that describes the switching between the two phases of our method is given in Algorithm  1. Here, the agent starts following a predefined sinusoidal path, until the reward value exceeds a noise threshold (line  10). If this happens, then the agent switches to a path that it generates dynamically by selecting the best action via the trained DQN, until a terminating condition arises (line  18). Then, upon the encounter of an exit condition, the agent goes back to the predefined sinusoidal path. This switching logic which ensures that the agent oscillates between two modes is what effectively facilitates the hybridization of the method.

As a summary of this section, and as an overall remark on the novelty and merits of the RL interfaces delineated here, we would like to articulate that:

  1. 1.

    Using the direction of the host platform in conjunction with the time windowed signal pattern generated by the GPR, we effectively specify our observation vector in a high level feature space. Recalling that the observation vector is the input of the Neural Network approximating the Q-Function (i.e., the DQN modeling the behavior of the agent), this approach leads to a significantly simplified architecture of the DQN. In similar settings in the relevant literature, the input of the DQN is usually constructed using lower level data spaces such as an image representing an orthogonal snapshot of the environment including the agent, or a complex image frame capturing an instantaneous view using an onboard camera (e.g., a sensor on the host platform), etc. This entails an additional set of feature extraction layers in the DQN architecture, which introduces further computational complexity and may lead to additional processing burden for proper interpretation.

  2. 2.

    By establishing a direct relationship between the movement direction of the agent and a particular differential change pattern in the reward signal, we ensure that our training process exploits the inherent correlation between the navigational movement of the host platform and signal processing pipeline of the GPR, in particular during the late target detection phase. This makes the training process more efficient and more importantly facilitates easier adoption of the trained models in real systems.

  3. 3.

    By excluding the location information from the observation vector, we ensure that the correct bearing and orientation toward the target are learned by the agent without any dependency on target position and/or relative agent position. This is quite important, since it is undesirable to inject such location dependencies into an optimal RL policy if target locations and approach patterns to those targets are completely unknown at model deployment phase. Many of the similar RL training schemes for path planning applications introduce dependency to the location of the objects of interest (OOI) and the navigation paths of the platform, via 2D images representing situational maps. For OOIs with known, static locations this may be acceptable, but for unknown target detection, the trained agent must cope with random target positions with arbitrary layouts. Whereas using our approach, the agent learns a position agnostic policy, thanks to a distinct training scheme which is explained in detail in Sect. 4.3.

3.3.1 Time complexity of the algorithm

Time complexity of our hybrid path generation algorithm can be devised first considering the sinusoidal navigation path and then the RL generated sub-paths. Note that since we do not address the high level decomposition of a large area into trapezoidal sub-regions (or cells) we do not include detailed time complexity analysis of that initial stage. It suffices to say that Trapezoid Decomposition (TD) method, for instance, being one of the most common methods in the literature, is known to have \(O(n \log n)\) time complexity for polygon decomposition, and \(O(\sqrt{n})\) for corner stitching of the convex sub-regions [29].

Going back to our hybrid path, the generation of the initial sinusoidal path is achieved using a simple parametric function given in Eq. 6. That is, for a region of size R(XY), a path \(S = \left \{ p_0, p_1,\ldots , p_n \right \}\) consisting of n path points \(P_0(x_0, y_0),\ldots , p_n(x_n, y_n)\) can be generated by inserting \(x_i\) in Eq. 6. This is a one-time operation of O(n) complexity, where n is determined by the grid resolution of the sub-region.

For the RL generated sub-paths, on the other hand, we need to calculate the cost of selecting an action by the RL agent at each step, starting from the moment the trained agent is activated via a signal threshold, to the moment a target is detected. This process spans the period defined in Algorithm  1 depicted by the loop starting at line  12. From a time complexity point of view this loop determines the number of times the DQN agent is queried for action selection for each sub-path, which in turn determines the number of path points, say on average m, in that particular RL-based sub-path. As indicated in Sect. 4.4, our DQN implementation uses a neural network, in fact a Multi-Layer Perceptron (MLP), with an input layer, an output layer, and 2 hidden layers. The cost of running an inference via a trained MLP is proportional to the matrix multiplications across the dense layers, which is given as \(RM_{Dense} = k \times l\) for a single layer [30], where k is the number of features in the input vector and l represents the number of neurons in the layer. So time complexity of our DQN would be \(O(k_i l_i + k_i l_h + k_h l_h + k_h l_o)\), where \(k_i\) is the size of the input vector, \(l_i\) is the number of neurons in the input layer, \(l_h\) is the number of neurons in the hidden layers, \(k_h\) is the size of the input vector to the second hidden layer and also to the output layer and finally \(l_o\) is the size of the output layer.

However, since, as indicated in Sect. 4.4, our input vector size (16), output vector size (5) and number of neurons in each of the 2 hidden layers (400) are constant,Footnote 1 we can define the computational complexity of a single inference through our DQN as constant O(C), where C would depend on the hardware resources available to the DQN agent implementation. In fact in our experiments, this inference, on average, takes 5.17 ms when only CPU is used, and 1.4 ms when the GPU is utilized, on a machine with an Intel core i7 CPU and an NVidia RTX 3060 GPU. Since each RL-based sub-path is computed for every target to be discovered, the number of buried targets, say p is also a factor influencing the time complexity, leading to O(mp) for all RL-based sub-paths in a region (recall that m was the average number of times DQN was queried for action selection for a particular RL-based sub-path).

As such, in total we have a time complexity of \(O(n + m p)\) for each sub-region obtained from the initial high-level decomposition. Note that, we do not include the complexity of the RL training process, since training is done only once, for a single target (or perhaps several representative targets) and it is not repeated for different area sizes or as the number of targets grows. Once the training phase is over, the trained agent is deployed to generate RL-based sub-paths mentioned above for different scenarios.

4 Experiments and discussions

4.1 Overview

To conduct the experiments we have implemented a software framework in Python programming language. The framework, depicted in Fig. 9, consists of three main component:

Fig. 9
figure 9

High level design of our experiment framework

  1. 1.

    Experiment Manager This component has an extensible architecture that allows easy specification of objects, materials, the burial environment, hyper-parameters for the experiment and mission specification. This component is used off-line to generate high-fidelity simulations representing B-Scan outputs of a GPR system, as specified in the detailed configurations for hardware of the GPR system, the burial environment (e.g., soil), as well as the material and geometry of the buried objects. It achieves this via a wrapper for GprMax simulator, an open source software that simulates electromagnetic wave propagation, using Yee’s algorithm to solve Maxwell’s equations in 3D using the Finite-Difference Time-Domain (FDTD) method. GprMax generates very high fidelity simulation results and allows detailed modeling of both GPR components, object materials and geometry as well as the burial medium.

  2. 2.

    DQN Agent This component implements the learning agent as discussed in Sect. 2.1. It includes a neural network for approximating the Q-Function, incorporates an optimizer, a trainer, mechanisms for managing the experience replay buffer, and all the other essential components for managing a proper MDP environment. For this part, we utilized ChainerRL [31], a reinforcement learning library that implements various RL algorithms including DQN with Prioritized Experience Replay.

  3. 3.

    RL Environment One of the main contributions of our framework is the RL environment and its sub-components that handle post-processing of GPRMax simulation results (to calculate background subtraction and cross-correlation), reward signal generation, observation (state) update, the buffer manager responsible for target discovery and detection logic and host platform behavior generation. This component is build from scratch using Python programming language.

4.2 Use of GPRMax for training data generation

We utilized GPRMax to obtain A-Scan and B-Scan data for different types of objects in terms of burial depth, material and geometry. Figure 10a illustrates an example experimental layout containing objects of different sizes and materials, buried at different depths. Figure 10b illustrates at the top, the original B-Scan image containing three object signatures obtained from the layout given in Fig. 10a. The image in the middle shows the B-Scan after background subtraction. The graph at the bottom displays the normalized cross-correlation values corresponding to that B-Scan. Note that normalized cross-correlation values clearly indicate the relative strength of signal returns obtained from different objects.

Fig. 10
figure 10

a An experimental layout containing objects of different sizes and materials, buried at different depths. Defined using GPRMax and visualized using Paraview, and b a B-Scan image depicting signal returns from three objects and corresponding target signatures obtained using cross-correlation function

In the pre-processing steps of our method we used 1-D band pass filtering for A-Scan signals and constant background removal to reveal target signature. Background A-scan signal are calculated from first ten A-Scan signals by taking average at each depth level (z-axis). Then, this 1-D signal is subtracted from each column of GPR data. After that we apply cross-correlation function as defined in Eq. 15, to reveal a one dimensional target signature,

$$\begin{aligned} c_k= \sum _n{a_{n+k}} . \overline{v_n} \end{aligned}$$
(15)

where a and v indicate sequences (in our case A-Scan arrays) being zero padded where necessary and \(\overline{x}\) denoting complex conjugation.

4.3 The training procedure

Since our method is based on a parametric search path combined with a RL-based target detection phase, we applied a specific training procedure. Our approach is based on training the agent with a single target, and making sure that target detection logic is not dependent on the location of a specific target. In other words, irrespective of the coordinate of the target, only approach direction and the change in the relative signal strength along that direction is used as the basis of correlating actions to rewards. We achieve this by randomly changing the initial location of the agent at each training episode along an arc that surrounds the target as a semi-circle covering the possible approach directions. As such, by perturbing the bootstrap location of the host platform at each episode randomly, we aim to model various typical approach directions to a target, while the platform was moving along the search path. This is illustrated in Fig. 11 on the left. This training process continues until the agent consistently achieves the maximum reward value which includes the target discovery reward in addition to the accumulated positive signal levels. As illustrated in Fig. 11 on the top graph, this happens around 260 episodes. The success level of the training is depicted by the visual output of the final evaluation step, depicted on the right of the same figure. This visual representation includes various different paths starting in the form of predefined sinusoidal pattern and then switching to the learned approach paths optimized for a single target detection. In other words, this is an illustration of the hybrid search path scheme in a limited, single-target setting. Here, the transparent green circles connote the area identified by the agent as the enclosing circle likely to include the target position. The opaque blue circle mark the actual position and the shape of the buried object from an orthogonal top view. Note that all the paths in the figure lead to a successful target discovery.

Fig. 11
figure 11

An illustration of the RL training procedure

At this point, it is worth highlighting the benefits of adopting this training approach. In particular:

  1. 1.

    Training the agent using only a single target enables us to reduce the computational complexity of the high fidelity GPR simulator (i.e., GprMax). This complexity can be huge and at times intractable if the simulation area is large and targets are abundant. Combining typical soil properties with typical object material properties for one representative scenario, one can produce high fidelity simulation results in a spatially constrained setting efficiently, and later duplicate and distribute the signal signature to a large area to emulate arbitrary sparse layouts for extensive training and evaluation sessions.

  2. 2.

    By imposing temporal and spatial constraints on the training process, we alleviate the sparse reward problem, which can be a significant predicament for reinforcement learning algorithms. The sparse reward problem emerges when the reward signal fails to provide timely and informative feedback to efficiently direct the agent toward improved behavior. Deficiently  designed reward functions, inherent complexity of the exploration space and the so-called long horizon episodes are major causes of this complication. Our approach provides a practical solution to this problem.

4.4 The architecture of the DQN and hyper-parameters

For the experiment scenarios, we used a DQN with input size of 16 (equal to the size of the observation vector), an output size of 5 (equal to the size of the discrete action space) and 2 hidden layers of 400 channels each, to model the agent behavior. The optimization algorithm used for the neural network is ADAM [18]. The minibatch size for sampling the experience replay buffer is 128 and initial start size of the buffer is twice the minibatch size (i.e., 256). We use a decaying \(\epsilon\)-greedy explorer which starts with \(\epsilon =1.0\) and eventually declines down to \(\epsilon =0.1\). The update rates for value and target networks are 1 and 50, respectively. In all of the experiments, we performed the evaluation process every 10 episodes averaging the cumulative rewards obtained from 5 consecutive evaluation runs. During evaluation runs, the agent does not take any exploratory actions; it fully exploits the current learning level of the DQN. During the evaluation, once an object is detected, the platform excludes the enclosing circle for that object from the navigation logic, which effectively results in resetting the reward values of that particular circular region to zero, so that the reward signals from that area do not lead to anomalous re-discovery of the same object.

4.5 Experiment results

To observe the performance of the trained agent we generated a number of mission areas, with varying sizes and random target layouts with certain constraints. The targets are generated in as many number as the horizontal area dimension (i.e., the x-axis) allows with respect to the maximum range of the GPR. Then, for a given mission area, those targets are randomly drifted along the vertical dimension (i.e., the y-axis). The x-component of their position is also perturbed slightly to introduce further randomization. As such, we have defined 8 different field sizes with increasing number of buried targets and generated 10 random target layouts for each of those fields, obtaining a total of 80 different random target layouts.

Fig. 12
figure 12

Different target layouts randomly generated for testing

Table 1 Ratio of successfully detected targets to all buried targets, using a search path parameterized with the forward range of the GPR (i.e., \(B = 2 \times R_x\) in Eq. 4)

Table 1 provides the average ratio of successfully detected targets to all buried targets in all of the auto-generated fields, some of which, with varying mission area sizes and target numbers, are visually depicted in Fig. 12. The column titled sz - trgts denotes the length and width (i.e., (XY)) of the mission area and the number of targets buried in that area, separated by a dash.

Table 2 Ratio of successfully detected targets to all buried targets, using a search path parameterized with 10% longer of the forward range of the GPR (i.e., \(B = 2 \times R_x \times 1.1\) in Eq. 4)

The columns titled ly-1, ly-2, etc., indicate 10 different target layouts, each having the same number of targets dispersed randomly using an algorithm which ensures that the dispersion covers the full extent of the area, as explained above. For each layout, 6 different search paths are generated by changing the value of k given in Eq. 4. This is to ensure that different entry points to the area (along y-axis) are accounted for in the experiments. Other key parameters such as B are set to \(B=2 \times R_x\) (\(R_x\) being equal to the range of the GPR) and \(\alpha\) is set to \(\alpha = R_x\). The ratios in each cell are calculated by taking the average of the detection performance for these 6 different paths for a particular layout. Consequently, a total of \(8\times 10\times 6 = 480\) trials are run. Note that almost all of the trials result in successful discovery of all hidden objects. The reason for the relatively low (0.89) average performance in the 23-target field in the case of layout-9 is that one of the paths leads to an unexpected agent behavior when an early target is encountered, leading to a premature ending. This seems to happen due to a relatively peculiar approach angle of the agent to the target in question. Although further RL training might help, there is no guarantee that some other path and target approach angle combination would lead to a similar condition. This is an anticipated outcome of near-optimal solutions such as RL. A more obvious precaution to avoid such behavior is to lower the value of \(R_x\) slightly to increase the granularity of the search path.

An expected observation is that when the initial search path is designed with an \(R_x\) value %10 higher than the GPR range, this extension induces a noticeable degrading effect on the performance of the method. This is illustrated in Table 2. Note that, for most of the trials, the average detection performance falls below %100 and for 8 layouts it falls below %90 (although it remains above %80).

Fig. 13
figure 13

Evaluation results for various different target numbers and layouts

Fig. 14
figure 14

A top view of the 3D animation plot, illustrating a drone following the hybrid path generated by the algorithm

To enhance the numerical results given above, we provide visualization of recorded traces pertaining to several runs obtained from different layouts, to illustrate how the agent switches from the initial search path to the RL-based target detection path. The results given in Fig. 13 indicate that the GPR can detect targets overlaid into the mission area in different layouts. Multiple paths in the figure denote different entry points to the region. As stated earlier, the transparent green circles connote the area identified by the Agent as the enclosing circle likely to include the target position. The opaque blue circles mark the actual position and the shape of the buried objects from an orthogonal top view. Note that all of the given paths lead to a successful detection of targets in the given layouts.

Fig. 15
figure 15

A close shot view of the drone approach paths to two of the objects. This view illustrates the switching between parametric sinusoidal path and the RL-based target detection path learned by the agent

An appropriate path among the generated paths illustrated in Fig. 13 can then be fed into the control software of a drone as a set of way points, to guide the drone in a automated way. We implemented a simple drone control model that can simulate the movement along a given set of path points (mostly know as way point guided navigation). The quad-copter simulation is adapted from Bobzwik’s work (available from  [32]) which is mainly inspired from PX4  [33], an open-source flight controller (i.e., autopilot). Using our software framework, it is possible to feed a selected optimized path to the drone simulator mentioned above, for a certain mission layout. Figure 14 illustrates how this mechanism works, where a selected path is used to animate a quad-copter along the points of the path. This figure is also a manifestation of how our method copes with extreme cases where the targets are located at the edges and relatively far from the sinusoidal path.

To elaborate further, we present, in Fig. 15, a close shot sequence of the path switching procedure (captured from the animated simulation), between the two phases mentioned earlier. The targets selected for the illustration are marked with numeric labels, 1 and 2, in Fig. 14. Figure 15, then, shows the sequence of drone movement that results in successful detection of those two targets. Note that, once the target signal level is sufficient to trigger the RL-based approach path, the drone leaves the sinusoidal search path and follows the way points received from the trained agent to approach the target in a such a way that ensures the proper detection of the target. It then goes back to the search path to continue the coverage scheme planned earlier.

At this point, we’d like to compare our method to similar approaches. However, as noted earlier in Sect. 2, it is not straight forward to find an analogous hybrid method from the literature, that combines a parametric path with RL triggered navigation sub-paths in the same way it is adopted in our approach. Therefore, we adopted the Boustrophedon Movement Pattern (BMP) as the base method for performance comparison, which is a part of the Boustrophedon Cellular Decomposition (BCD)  [34] approach that deals with the internal coverage of each cell. BMP is a well-known coverage path planning approach for simple geometric cells and is used as the base method for performance comparisons in many studies. To this end, we run the experiments using the same layouts given above, and recorded both target discovery performance and navigation cost in terms of path-length units. Here, we assume that the GPR platform follows the path generated by BMP strictly, and target detection occurs when a certain point on the GPR scan path falls within the radius of the buried object so that a clear target signature is generated.

Table 3 shows the performance of BMP-Coverage Path (BMP-CP) when the grid resolution is 20 units, which is a comparable distance to the sinusoidal search path adopted for our hybrid method (i.e., it is the half of the period of the sine wave along x-axis, and more specifically it corresponds to \(R_x\) given in Fig. 3). Note that, for the majority of the layouts the target discovery performance remains below 60%. This is visually depicted in Fig. 16. Furthermore, a visual performance comparison of our RL-based hybrid path with respect to BMP-CP for that grid resolution is given in Fig. 17, where target discovery of BMP-CP is indicated by an enclosing circle filled with diagonal line pattern, to distinguish it from those of our hybrid path.

Table 3 Ratio of successfully detected targets to all buried targets, using a classical boustrophedon CPP with a grid resolution of 20 units (i.e., comparable to the resolution used for our hybrid CPP)
Fig. 16
figure 16

A visual representation of boustrophedon CP performance in an area with 6 buried targets, for a grid resolution of 20 units

Fig. 17
figure 17

A visual comparison of our Hybrid Method and boustrophedon CP in the same area layout as the one given in Fig. 16

Fig. 18
figure 18

Boustrophedon CP performance in the same area as in Fig. 16, but this time for a grid resolution of 10 units

Table 4 Ratio of successfully detected targets to all buried targets, using a classical boustrophedon CPP with a cell resolution of 10 units (i.e., half of the resolution used for our hybrid CPP)
Table 5 Path length-based cost improvement of our hybrid method over high-resolution and low-resolution boustrophedon CPP

Note that all of the targets are discovered using our method, whereas most of the targets cannot be detected by the BMP using this particular grid resolution. If the grid resolution is increased two fold, then the target discovery performance of BMP-CP improves considerably, reaching the level of our hybrid method, as depicted in Fig. 18 and indicated in Table 4. However, this is achieved only at the expense of increasing the navigation cost by more than twice. A complete cost comparison of our method to the BMP-CP method in terms of path length, both for low resolution and high-resolution grids, is given in Table 5, where the column titled Cost incrs. of HiR BP gives cost increase of the high-resolution BMP-CP with respect to our method, and the column titled Cost incrs. of LwR BP gives the cost increase of the low-resolution BMP-CP in comparison with our method. Here, the length of paths for all of the cases is calculated using Eq. 7. Note that, even with a low-resolution BMP-CP where the target discovery performance is very low, BMP-CP is more than \(20\%\) more costly in path length. With the high-resolution grid, where target discovery performance of BMP-CP reaches a comparable level, BMP-CP is more than twice more costly in path length.

5 Conclusion

In this paper, we have introduced a method that combines a coarse search pattern with a machine learning model trained for object detection using reinforcement learning (RL) to facilitate more efficient exploration of out-door environments for target detection. Our results indicate that an agent trained via RL for converging to target signatures of particular characteristics can help in directing the navigation logic of the platforms hosting a GPR, leading to more efficient exploration procedures. Our methodology offers a number of benefits including:

  1. 1.

    An efficient RL training approach that addresses sparse reward problem,

  2. 2.

    An elegant RL interface design that divorces the optimal policy from dependency to static layouts (i.e., fixed target locations and relative position of the GPR host platform), facilitating successful target detection in mission areas with unknown target layouts,

  3. 3.

    A reward function design based on cross-correlation values that exploits the inherent correlation between the navigational movement of the host platform and signal processing pipeline of the GPR, in particular during the eventual target detection phase.

The method presented in this paper is applicable to open field missions where the airborne platform is not subject to obstruction; however, obstacle avoidance techniques can be augmented. One important observation is that the performance of the method is sensitive to the selection of the parameters in conjunction with the forward range of the GPR, when designing the initial search path. A direct ramification of this is that our method stands to benefit from advances in GPR technology that increases the forward range.

Another benefit of our approach is that the RL-trained target discovery logic can be embedded into a real GPR system in conjunction with the navigation guidance logic of a host platform to facilitate on-line, real-time target discovery tasks. However, the method as it stands has some limitations and thus requires some improvements for this to happen:

  1. 1.

    Currently, the target signatures used for RL training are obtained via simulation. Although our simulation framework utilizes a very high fidelity tool, our environment definition lacks accurate clutter definitions and high-resolution material specifications to avoid immense computational costs. It would be more desirable to either run the simulations using detailed soil, material and object specifications, or use real data samples obtained from systematic measurements using real systems. Further work is planned to alleviate this limitation.

  2. 2.

    In the current method, the action space used during the RL-agent training consists of a set of simple discrete movement steps along several directions. In fact, a kinematic or dynamic model of the host platform can be incorporated into the specification of the action space to make the agent training more faithful to real-world scenarios. This, as above, has some negative ramifications for training cost, but would contribute to the success of the agent deployment into a real system.