1 Introduction

Atomic force microscopes (AFMs) are versatile tools for measuring surface topographies in the micro- and nanometer range. During measurement, an AFM scans a sample surface with a sharp tip attached to the end of a cantilever. Usually, the cantilever probe is used as a zero sensor for this and the topography is tracked with a controller utilizing the measured interaction between the probe tip and the sample. By keeping the interaction between the tip and the sample as constant as possible, a large measurement range is enabled while reducing the influence of nonlinear probe effects, tip wear and the risk of damaging the tip or sample due to large interaction forces.

Currently, typical proportional integral differential controllers (PID) [1] are used in AFMs, as they are easy to implement and stable when parameterized appropriately. However, they have the disadvantage of comparatively low performance, which results in unnecessarily large control deviations. For topographies with steep slopes, these can become so large that, for example, in tapping mode, the tip temporarily loses complete contact with the surface or is hardly pressed against it, causing tip breakage or damage to the sample surface [2]. To reduce this risk, typically, the AFM scan speed is reduced. This provides more time for the control system to respond to topography changes but increases the measurement time and thus the influence of drift, making the measurement less accurate.

A better solution to this problem is the application of more sophisticated controllers using model-based or model-free optimal control approaches. For AFMs, primarily model-based approaches are used [3,4,5], as they are normally easier to realize. However, as the model inevitably has some inaccuracies, the performance of the respective controllers for a real AFM is not completely optimal. This problem is encountered by model-free approaches, where the controller is optimized for the real AFM behavior [6].

In principle, while all optimal control approaches offer great potential for improvement compared with PID control, they also have severe practical disadvantages. Classical optimal control approaches used for AFMs define optimality through the norm of control deviations, typically using the L1 or L2 norm. This optimality is adequate if control deviations of lower and higher tip–sample distances than targeted are equivalent. With AFMs, however, this is only the case for small control deviations. In the case of large control deviations (e.g., in tapping mode), the risk of hard contact and thus damage to the sample or AFM tip exists if the tip is too close to the surface; therefore, a control deviation in the direction of a higher tip–sample distance and possibly a loss of contact is preferable instead. A loss of contact only renders the directly affected data points unusable, but these can be rescanned later with a lower scan speed. The breakage of a tip instead is often significantly more severe, e.g., for ultrasensitive mechanical [7] or magnetic force [8] measurements, where a tip breakage requires a restart of the measurement using a new tip. Another example is the quantitative geometrical AFM measurement of extreme ultraviolet photomask structures [9], where the tip shape is calibrated for subsequent correction of the tip-dilation effect [10, 11]. For such measurements, a tip breakage also requires a restart of the measurement and, additionally, a time-consuming calibration of the geometry of the new tip. Thus, for practically optimal AFM control, not only the control deviations themselves should be minimized, but also avoidance of tip and sample damage should be of major concern. A practically optimal control behavior should, in this respect, be dependent on the type of situation. In an unchallenging situation, in which hard contact is not likely to occur, control deviations should be minimized, but in a challenging situation, in which hard contact potentially occurs, control behavior should become asymmetric to avoid damage to the tip and sample as much as possible. To realize a controller with such a behavior, using artificial intelligence (AI) based on deep reinforcement learning is a promising approach.

AI has already been used successfully for various applications in nanometrology, such as the measurement of multilayer semiconductor structures [12], small angle measurements using a second harmonic generation setup [13], and AFM [14]. However, the main focus of these works was on data post-processing and data analysis. Nevertheless, AI has also been applied in the field of AFM control; however, its utility is only to select the parameters of a PID controller [15, 16], limiting its performance impact. Thus, herein, we intended to develop an AI that directly calculates the motion commands being sent to the AFM scan stage. By doing so, the AI is not limited by underlining PID behavior and can reach the highest possible performance. To our knowledge, the direct use of AI to control an AFM scan has so far only been attempted in one study [16], but in that work, the AI was inferior to a reference PI controller, likely because its design was not complex enough.

Herein, we present the concept of an AI controller that can show the complex control behavior described above for maximum protection of the AFM tip and sample against damage in challenging situations and that still has the potential to perform better than a classical PID controller in unchallenging scan situations. To achieve the best possible performance, the concept intends to train the AI on a real AFM. This paper is structured as follows. Section 2 presents the AFM model for the development and testing of the AI concept. Section 3 highlights the concept of the AI controller. Section 4 outlines some simulation-based examples to demonstrate the potential performance of the developed concept and AI controller.

2 AFM Model for Developing the AI Controller Concept

The AFM model used for the development and testing of the AI controller concept is based on the behavior of a self-developed low-noise AFM. Its operating principle and main components are schematically shown in Fig. 1.

Fig. 1
figure 1

Principal setup of a tapping mode AFM

The first important aspect of the AFM behavior is the interaction of the cantilever and tip with the sample. In the tapping-mode AFM, the tip and cantilever are driven to a high-frequency oscillation of a few nanometers of amplitude and then approach the sample surface. If the tip–sample distance \(d\) is smaller than the free oscillation amplitude \(A_{f}\), which is in the model set to 20 nm, the tip briefly touches the surface at the pivot point of the oscillation and the oscillation amplitude \(A\) drops approximately to the remaining tip–sample distance \(d\). This interaction is simply modeled using Eq. (1).

$$\left\{ {\begin{array}{*{20}l} {A = A_{f} } & {{\text{if}} \;d > A_{f} } \\ {A = d} & {{\text{if}}\; 0 \le d \le A_{f} } \\ {A = 0} & {{\text{if}}\; d < 0} \\ \end{array} } \right.$$
(1)

The oscillation of the cantilever is detected via a laser beam reflected from the cantilever onto a photodiode. The oscillation amplitude of the recorded signal is determined using a lock-in amplifier. This measurement process is not modeled explicitly. Instead, only a noise influence of the measurement is added to the amplitude \(A\) determined using Eq. (1). For this, white noise with a standard deviation of 0.1 nm is used.

During topography measurement, the sample is laterally scanned below the tip by a piezo scanner and the measured values of \(A\) are transferred to a controller. The controller uses these values to calculate a control signal for the z-axis of the piezo scanner to keep \(A\) and thus also \(d\) as constant as possible at setpoint \(A_{s}\), which is here set to 10 nm.

To model the dynamic response behavior of the piezo scanner z-axis to the controller commands, the differential equation of a damped single-mass oscillator is used. To consider the piezo hysteresis, this model is extended by a classic Bouc–When model [17] and a so-called linear creep model [18] is added for the influence of the piezo creep. To account for the closed-loop piezo scanner behavior, a PI controller is modeled. Except for the hysteresis, for which the parameters were selected in such a way that approximately 10% nonlinearity can occur, the model was parameterized manually using step responses of the piezo scanner. The evaluation of the piezo scanner model for the simulation of AFM measurements is conducted for time steps of 0.1 ms, according to the AFM sampling frequency of 10 kHz. For this, the Euler method is used with a step size of 0.5 µs. Figure 2 displays the simulated open- and closed-loop step response of the piezo stage for a 100 nm motion command. In open-loop mode, the piezo scanner responds quickly but shows a significant overshoot. In closed-loop operation, there is no undesirable overshoot, but the scanner z-axis only reaches full deflection with a considerable time delay. This poses a challenge for the control of the tip–sample distance during an AFM scan, as a large part of the effect of a motion command transmitted to the piezo scanner becomes effective only in subsequent time steps.

Fig. 2
figure 2

Simulated open- and closed-loop 100 nm step responses of the piezo scanner z-axis

A final aspect that is important for modeling the scanning behavior of the AFM is sample topography. This topography should be available as a real sample, such that it can also be applied for real-world training of the AI. Meanwhile, the topography must have high variability and a wide range of surface gradients so that the AI is generally applicable to any surface after training and not only to special surfaces similar to a specific training topography. A simple way of generating a sufficiently variable topography was found by pressing the tip of a PPP-NCLR cantilever (NANOSENSORS™) with a nominal stiffness of 48 N/m (according to the manufacturer) about 2 µm into a smooth silicon surface. This resulted in an indentation area with a pile-up, which, together with the flat areas of the original surface, shows good variability of surface gradients required to train an AI controller that is applicable to a wide range of different surface topographies. The indent topography has been measured with a new AFM tip. One of the obtained irregular profiles is used as the topography for training the AI with the AFM model, which consists of 3001 data points with a lateral distance of 1 nm, as depicted in Fig. 3. As intended, the profile has a good variability of surface gradients and, importantly, a steep descent, which enables the AI to learn the desired asymmetric behavior. The other profiles are used for testing.

Fig. 3
figure 3

Reference surface topography for training the AI controller

3 Concept of the AI Controller

There are three basic components needed to develop a functional AFM controller. First, a suitable AI algorithm has to be chosen. Second, based on this algorithm, a suitable setup of the AI controller for the control task has to be found. Third, based on this setup, a suitable training strategy has to be developed. The components and details of the concept are presented below.

3.1 AI Algorithm

To select a suitable AI algorithm, first, the principal functionality of the AI controller has to be clear. The aim of this work is that the AI uses the measured amplitude values like a classic controller to calculate motion commands for the closed-loop piezo scanner z-axis. For this, the AI should use a control strategy that is as optimal as possible, which has to be learned beforehand. A suitable algorithm for this purpose is the double deep Q-learning technique (DDQL), which is based on combined Q-learning [19] and deep learning and makes it possible to learn optimal decision-making strategies for complex situations. The basic ideas of this algorithm are explained in more detail here, taking the AI controller as an example.

The starting point of DDQL is Q-learning, where a so-called agent (the controller) executes one of several possible actions \(a\) for an observed control state \(s_{t}\). Here, the actions \(a\) correspond to various discrete motion commands for the piezo scanner z-axis, while the state \(s_{t}\) is characterized by the measured oscillation amplitude. As an action \(a_{t}\) executed at time \(t\) has an effect on the AFM, the AFM transitions into a new state \(s_{t + 1}\) in the next time step.

Which action \(a_{t}\) is being executed by the agent for an observed \(s_{t}\) depends on how good the respective action is for the fulfillment of the agent's task of keeping the oscillation amplitude constant. This is assessed through the so-called Q value \(q\left( {s,a} \right)\), which is defined in Eq. (2):

$$q\left( {s_{t} ,a_{t} } \right) = r_{t} + \gamma *r_{t + 1} + \gamma^{2} r_{t + 2} \ldots$$
(2)

The aim of training the AI controller is to let the AI learn to estimate the \(q\left( {s,a} \right)\) of all actions \(a\) as precisely as possible under the given input state \(s_{t}\). Thus, after executing an \(a_{t}\), the agent receives a reward \(r_{t}\) as direct numerical feedback on how good this \(a_{t}\) is for controlling the AFM. The \(r_{t}\) for this is calculated from a previously defined reward function based on \(s_{t + 1}\) and potentially other parameters. As \(a_{t}\) not only has an influence on \(s_{t + 1}\) but also on future states, the rewards of future time steps also contribute to \(q\left( {s_{t} ,a_{t} } \right)\). However, as this influence decreases over time, future rewards become gradually weighted down by a discount factor \(\gamma\) of 0.7 here. Generally, Eq. (2) can be used for calculating \(q\left( {s_{t} ,a_{t} } \right)\) based on the observed rewards to let the AI learn the Q-values; however, in practice, this equation is difficult to apply due to the long sum of elements. However, as the AI is intended to choose the best possible \(a_{t}\), which should have the highest Q-value for a given \(s_{t} ,\) Eq. (2) can be reformulated to

$$q\left( {s_{t} ,a_{t} } \right) = r_{t} + \gamma *\mathop {\max }\limits_{{a_{t + 1} }} q\left( {s_{t + 1} ,a_{t + 1} } \right)$$
(3)

With this equation, knowledge of the \(r_{t}\) received for \(a_{t}\) and the subsequently reached state \(s_{t}\), together with an estimate of the maximum Q-value for this \(s_{t + 1}\), is sufficient to approximate \(q\left( {s_{t} ,a_{t} } \right)\). For training the AI, the estimate of the maximum Q-value for \(s_{t + 1}\) is provided by the AI itself, and the received \(r_{t}\). can be used to gradually improve the AI knowledge of the Q-values based on Eq. (3).

In deep Q-learning, an artificial neural network, namely, Deep-Q-Net (DQN), is used to learn the relationship between the states \(s\) and the Q-values of all possible \(a\). An (artificial) neural network consists of several layers of neurons where the outputs of the neurons of one layer are connected to the inputs of the neurons of the following layer by weights. In a single neuron, these weighted input values are summed up and added to a bias value. Afterward, a nonlinear activation function is applied to this sum. Initially, the weights and bias values of the DQN, which determine the Q-values calculated in the output layer of the network for a given \(s\) in the input layer, are initialized randomly. To learn the calculation of increasingly realistic Q-values with the DQN, the current state \(s_{t}\) is passed to the DQN and an (assumed) Q-value is calculated for each \(a\). One of these \(a\) (usually the one with the highest Q-value) is selected and passed to the piezo stage. The resulting \(r_{t}\) and \(s_{t + 1}\) are then used to calculate an updated Q-value via Eq. (3) and to adjust the weights and bias values of the DQN using backpropagation [20]. In this way, the DQN gradually improves its ability to estimate the "quality" of each \(a\) for a given \(s\) and thus its ability to control the AFM.

For the Double DQN (DDQN), which is used here for the realization of the AI controller, two structurally identical networks are applied instead of a single one [21]. In DDQN, the so-called action network is used to select the actions during training, while the so-called target network estimates the Q-values of the subsequent states. While the action network becomes frequently updated using the backpropagation algorithm, the target network becomes updated directly, but less frequently, by the action network. The use of two separate networks with different update timeframes has the advantage that a systematic overestimation of Q-values for some actions that are typical in DQN is avoided [22], improving the training results of the AI later.

3.2 Setup of the DDQN as an AFM Controller

To ensure good functionality of the AI controller, the neural network structure used for the DDQN has to be suitable for the control task. At the network input layer, the AI has to receive sufficient information on the control state of the AFM to be able to identify good actions, and at the output layer, suitable actions must be available to choose from. Moreover, the network requires sufficient neurons in the hidden layers between the input and output layers to be able to map the complex relationship between \(s\) and Q-values of the actions \(a\).

For the AI controller, 10 neurons have proven sufficient for the input layer. Five of them receive each one of the last five measured deviations of the oscillation amplitudes from the setpoint value. These allow the AI to recognize the control state. The other five input neurons are used to receive the last five selected motion commands sent to the piezo scanner so that the AI can also estimate the motion state. The motion state of the piezo stage cannot be determined from the amplitude deviation values alone, but it still has a major influence on the effect of the selected actions.

Between the input and output layers, six fully connected hidden layers are used. The first five hidden layers each consist of 60 neurons using the LeakyReLU activation function [23], while the sixth hidden layer consists of 401 neurons without an activation function.

The number of actions available to the AI at its output layer is set to 401. This number is taken over by the number of neurons in the last hidden layer. Each of the available actions of the AI represents a motion command \(\Delta z\) for the piezo scanner. In total, these cover a motion command range of − 30 to + 30 nm at 0.15 nm resolution. A specialty of the output layer is that it is a one-dimensional convolutional layer using one input and output channel and a kernel of size 21 with fixed weights of 1/21 each. This layer functions like an average filter, smoothing the output values of the last hidden layer to calculate the Q-values of the 401 actions at the output layer of the network. Without going further into details, the advantage of this convolutional output layer is an efficient and thus fast initial training of the AI, but it does not markedly improve the final performance.

3.3 Training Strategy

To train the AI to estimate realistic Q-values and to explore an optimal control strategy, the AI itself has to perform actions and learn from their consequences, as introduced in Sect. 3.3.1. Section 3.3.2 details the definition and calculation of the rewards based on the observed consequences of performed actions. Section 3.3.3 presents the actual training process of the AI based on the collected data.

3.3.1 Exploring the Consequences of Actions

Before starting to collect experience on how to control the AFM, the weights and bias values of the action and target network have to be initialized. This is performed randomly using a normal distribution with a standard deviation of 0.1. Afterward, the exploration of a control strategy is started. For this purpose, a random 1 µm long segment of the training topography from Fig. 3 is selected for scanning. The tip is then approached to the surface, and the selected segment is scanned at a speed of 10 µm/s. This scan is called a trial. At each sampling step of the AFM, the oscillation amplitude resulting from the current position of the tip above the sample is determined. Then, typically, the action \(a\) with the highest Q-value for the current state \(s_{t}\) is selected, or with a chance of 5% a random action is performed to explore the effect of expectedly not optimal actions.

As the AI is not a functioning controller at the start of the training and cannot estimate realistic Q-values, or as the AI can fail as it is not a deterministic controller, the AFM can easily get into critical control states during the exploration. To avoid this, a PID controller is used as a backup system, which selects the action if certain criteria are fulfilled. At the beginning of training, the magnitude of the control deviation in the current and previous time steps has to be greater than 6 nm for the PID to step in. This condition allows the PID controller to intervene before a loss of contact or a hard contact is reached but simultaneously leaves room for the AI to explore a good control strategy. After 300 trials, this condition is replaced by a new condition that gives the AI further freedom to optimize its control strategy. The new condition now features three requirements: First, the control deviation must be at least 9 nm in the current and last time steps. Second, the piezo scanner in the current time step must move in the wrong direction in the z-axis, i.e., increasing the tip–sample distance in case of contact loss or further pushing the tip on the sample during a hard contact. Third, the AI must have commanded a movement in the wrong direction at the last time step. If, depending on the trial number, the respective PID step-in condition is fulfilled, the control task is transferred to the PID controller for the next 40 time steps.

After an action chosen by the AI has been transferred to the piezo scanner, its effect on the AFM is assessed by measuring the oscillation amplitude again in the next time step \(t + 1\) and calculating the reward \(r_{t}\). The reward r is then not used immediately to train the AI but rather stored together with \(s_{t} , a_{t} , {\text{and}} \,s_{t + 1}\) in a replay buffer with a capacity for 500,000 data pairs. Only after a training scan has been completed that the AI is trained with the data from the replay buffer and a new training scan is started. This actual training of the AI will be detailed in Sect. 3.3.3.

The sequence of scanning the training topography for generating new training data and then performing the actual training of the AI has to be repeated several times to create a functioning controller. Typically, a few hundred repetitions are required to achieve good results, and much longer data collection and training do not improve the AI. Here, the number of repetitions has been fixed to 2000, which does guarantee long enough training without any improvements to be expected when continuing.

3.3.2 Definition of the Reward Function to Achieve Asymmetric Control Behavior

The reward \(r\) determines the control behavior that the AI is going to learn. Here, \(r\) is calculated from three components, i.e., \(r = r_{ + } + p_{n} + p_{s}\), which all have their own purpose. The purpose of the reward component \(r_{ + }\) is to result in a control behavior that keeps the oscillation amplitude as close as possible to the setpoint \(S\) of 10 nm. Therefore, it provides a positive reward between 0 and 1 with a pronounced maximum at the setpoint, depending on how small the control deviation \(\Delta S_{t + 1} = A_{t + 1} - S\) of the oscillation amplitude \(A_{t + 1}\) is. It has been observed that a sharp maximum for \(r_{ + }\) decreases the control deviations when scanning relatively flat topographies; thus, in Eq. (4), power 8 is used for the reward instead of, for example, power 2 or power 4.

$$r_{ + } = \left( {\left( {\frac{{S - \left| {\Delta S_{t + 1} } \right|}}{S}} \right)^{8} + 0.1*\frac{{S - \left| {\Delta S_{t + 1} } \right|}}{S}} \right)/1.1$$
(4)

To realize the desired asymmetric behavior of the control, which minimizes the risk of tip or sample damage in critical situations, two penalty terms \(p_{n}\) and \(p_{s}\) are used. \(p_{n}\) reduces the reward if the tip and sample are closer to each other than intended and, therefore, if \(A_{t + 1}\) is smaller than \(S\).

$$p_{n} = - 2*\frac{{\Delta S_{t + 1} }}{S} \quad{\text{if}} \;\Delta S_{t + 1} \le 0$$
(5)

Combined with \(r_{ + }\), \(p_{n}\) provides a reward as a function of \(\Delta S_{t + 1}\), as shown in Fig. 4. This configuration of the reward already leads to an asymmetrical control behavior of the AI, attempting to avoid hard contacts. However, it has been observed that the piezo scanner z-axis can build up momentum when following descending slopes of a sample geometry, which prevents it from being able to stop quickly at the end of the slope to avoid hard contact. To prevent damage in such a situation, the penalty term \(p_{s}\) is used to limit the speed of the piezo scanner to a safe level for extended descents. If the last four motion commands \(\Delta z\) in the ± 30 nm range are all greater than 0 nm, which corresponds to an approach motion of the tip and the sample surface and if the sum of these \(\Delta z\) is greater than 20 nm, \(p_{s}\) is calculated using Eq. (6). Depending on the motion pattern of the piezo stage, the amount of punishment \(p_{s}\) is between − 1 and − 2, or 0 if the penalty condition is not met.

$$p_{s} = - \left( {1 + \frac{{\Delta z_{t} + 0.5*\Delta z_{t - 1} + 0.25*\Delta z_{t - 2} }}{52.5}} \right)$$
(6)
Fig. 4
figure 4

Obtainable reward as a function of the control deviation

3.3.3 Actual Training of the AI

After a training scan has been finished and the AFM tip has been withdrawn from the surface, the actual training is started. For this, a batch of up to 2000 data pairs, if available, is sampled from the replay buffer. Using these data pairs, updated Q-values are calculated using Eq. (3). Then, the backpropagation method is applied with the so-called ADAM optimizer [24] and a learning rate of 0.002 to adjust the weights and biases of the action network toward the calculation of the updated Q-values. This process is repeated 20 times, followed by an update of the target network. Then, a new scan for the further collection of training data is started.

In addition to the classical training procedure of the AI, another training procedure is implemented here, namely, smoothed learning. The fundamental idea of smoothed learning is as follows: Because similar actions of the AI have similar effects on the AFM, similar actions also have similar Q-values. Therefore, if the Q-values calculated by the network for a given state \(s\) are arranged over the corresponding motion commands \(\Delta z\), the resulting Q-curve should be continuous. However, from Fig. 5, this is not the case after 200 training trials, where the Q-curves are superimposed with noise. As a result, the AI does not select optimal actions and the training of the AI becomes suboptimal as the calculation of updated Q-values using Eq. (3) is biased by the noise, too. To solve this problem, smoothed learning is implemented, in which a batch of states \(s\) observed during training is sampled from the replay buffer and the respective Q-curves are calculated by the target network. These are then smoothed with an average filter of width 31, and the Q-values obtained are used to adjust the action network weights and bias values using the backpropagation method again. This procedure performs a knowledge transfer between similar actions of the AI, which reduces the noise in the Q-curves and thus increases the performance and reliability of the AI controller.

Fig. 5
figure 5

Example of a typical Q-curve after 200 trials (blue) and the corresponding smoothed Q-Curve (orange) used for smoothed learning

The practical application of smoothed learning is only advisable once the AI has learned to calculate Q-curves that are principally reasonable. This is why, in this study, smoothed learning is only deployed after 200 trials. Thus, the same parameters are used as in classic training. However, instead of 20 repetitions after each trial, smoothed learning is applied only every 50 trials up to trial 500 and then every 500th trial. When smoothed learning is applied, 500 repetitions are used to achieve good knowledge transfer before updating the target network.

4 Results

Based on the above, the presented concept enables the creation of a controller with a complex control behavior, which can outperform PID controllers in unchallenging scan situations and shows an asymmetric behavior in challenging situations where damage to the AFM tip or sample may occur. Before this, however, a typical course of the training is presented.

4.1 Progression of the AI Training

Figure 6 shows the percentage of AI-selected actions during a trial for a typical training run and the curve of the score, which is the sum of all rewards in a trial. In the first trials after the initialization of the AI, both the score and the AI percentage are low. This is because the AI is initially not capable of controlling the AFM, as shown in Fig. 7a for trial 1. Thus, the AFM regularly reaches states outside the intended interaction range so that the PID control has to step in. However, as can be observed in Fig. 6, a rapid learning process occurs early on, usually showing the first effects around trial 20. In Fig. 7b, around this trial, the AI gains the basic ability to control the AFM. In the subsequent learning process, in which the score increases at a steadily decreasing rate up to trial 300, the AI refines its control skills. Interestingly, the AI percentage remains constant in this phase except for some typical fluctuations due to the stochastic nature of the training, indicating that the AI training is limited by PID control interference. From trial 300, where the stricter condition is used for the step in of the PID, the AI percentage increases to almost 100%, indicating that the AI can already control the AFM well at this point. However, from this trial onward, the score begins to increase again, as the AI now has more freedom to find and learn the best possible control strategy. Up to around trial 1000, the AI reaches its optimum performance with an almost constant score of around 850 out of a maximum possible 1000.

Fig. 6
figure 6

Development of the percentage of AI-selected actions (blue, left axes) and score of the trials (orange, right axes) during the training of the AI

Fig. 7
figure 7

Exemplary tip sample distance during a) trial 1 and b) trial 20 of the scans on the training topography. The nominal distance is 10 nm. Distances smaller than 0 nm correspond to hard contact, and those larger than 20 nm correspond to contact loss. Depending on the active controller, the distance is shown in blue (AI) or green (PID)

4.2 Performance of the Trained AI Controller

To demonstrate the performance potential of the trained AI controller, it is compared with an optimized PID controller for several simulated line scans, as this is the state-of-the-art for AFM control. While for all following experiments the AI from trial 2000 of the training run shown in Fig. 6 is applied, the PID is optimized for each individual scan to provide the best technically possible root-mean-square (RMS) control deviation. First, the AI and PID are used to scan the three topography profiles depicted in Fig. 8, all measured on the same sample as the training profile. All three profiles can be scanned by both controllers with a speed of 10 µm/s without contact loss or hard contact. For better comparability of the results, no noise has been added to the simulated oscillation amplitudes. The three simulated RMS control deviations are given in Table 1.

Fig. 8
figure 8

Topography profiles used for testing the AI controller

Table 1 RMS control deviation of an individually optimized PID controller compared with the trained AI for three different simulated line scans

For this case of unchallenging topographies, the AI controller results in 4 to 5 times smaller control deviations than the PID, even if the PID is optimized for the individual profiles while the AI is not. This shows the principally high potential of AI-based control for AFM. However, the AI is not only intended to result in small control deviation but also to avoid damage to the AFM tip or the sample due to hard contacts in challenging situations. This capability of the AI is presented in Fig. 9, showing the simulated scan of a topography with a steep slope. At the beginning of the zoomed-in slope region, both the AI and PID controller follow the topography. When the slope becomes steeper, the PID controller tries to continue following the topography. Thus, it builds up a considerable momentum, which prevents it from avoiding hard contact in the region where the slope of the surface suddenly decreases again, as can be seen from the oscillation amplitude of the cantilever. Compared with that, the AI controller reduces its speed in the region of the largest slope, causing the tip to lose contact with the surface for a relatively long distance compared with the PID but allowing the AI to avoid hard contact and thus the risk of damaging the AFM tip or sample, where the surface slope decreases.

Fig. 9
figure 9

Simulated scan of a topography with a steeply declining slope using the optimized PID and the AI controller. In blue, the tracking of the topography (green) by the two controllers is shown. The resulting simulated oscillation amplitude of the cantilever is depicted in red. Two zoomed-in segments show the behavior of both controllers for a relatively flat region and the steep slope of the profile

The topography scan in Fig. 9 clearly highlights the fact that the minimization of a simple norm, e.g., the L2 norm, is not the best objective for AFM control. The AI shows the practically preferable behavior of strict avoidance of hard contacts in challenging situations and small control deviations in unchallenging situations, as can be seen in the zoomed-in flat region in Fig. 9. Nonetheless, the overall RMS control deviation of the AI for this scan is due to the long loss of contact with 0.83 nm larger than the PID RMS control deviation of 0.74 nm.

5 Summary

In this study, a novel approach to AFM scan control is introduced through the use of an AI controller based on DDQL. The proposed AI controller is trained on simulated AFM scans and showcases the potential for significant improvement over traditional PID controllers. Its key innovation is its ability to adapt the control behavior dynamically, minimizing control deviations in unchallenging scan situations and demonstrating an asymmetric response in challenging situations to avoid potential damage to the AFM tip or sample due to hard contacts. Thus, unlike traditional optimal control approaches, AI does not solely focus on the minimization of control deviations, as this is only mathematically and not practically optimal.

In the training strategy presented here, the AI is by itself exploring a control strategy, as this, to our experience, yields the best final control performance. However, this approach presents the potential risk of damaging the AFM tip during the initial trials of the training, as the AI is not a functioning controller then. To mitigate this risk, a PID controller is used as a backup that steps in in critical situations. As the developed training strategy results in a fast transition from a limited control capability of the AI to an acceptable control capability early in the training, the risk of damaging the AFM tip or sample is further reduced.

Note that while the observed AI control behavior is impressive, the concept and results presented herein are solely based on simulations. Thus, future work involves implementing and testing our AI controller concept on a real AFM system, validating its effectiveness and real-world applicability for AFM.