1 Introduction

Industrial, collaborative, and mobile robotics have witnessed tremendous growth and adoption in various sectors, revolutionizing manufacturing processes, automation, and Human-Robot Interaction (HRI). These systems rely on many sensors to perceive and understand the environment accurately. Among the key sensors used, Frequency Modulated Continuous Wave (FMCW) radars, Time-of-Flight (ToF) cameras, and capacitive sensors for tactile and proximity perception (Capacitive Tactile Sensor (CTS) and Capacitive Proximity Sensor (CPS) respectively) have gained significant attention due to their unique capabilities and diverse applications.

FMCW radars are able to detect and track objects in both static and dynamic scenarios. According to their configuration, they provide precise range, velocity, and direction of arrival measurements, making them ideal for applications such as navigation and localization for automotive and mobile robotics [1, 2], as well as for human tracking and Speed and Separation Monitoring (SSM) [3, 4]. They work in challenging environments where dust, fog, smoke, rain, sparks or other occluding agents might be present [5].

ToF cameras, on the other hand, offer depth perception capabilities with higher spatial resolution with respect to radars, enabling accurate 3D mapping and object recognition [6, 7]. However, problems such as occlusion, varying lighting conditions, and various parasitic effects need to be addressed for their optimal utilization. Because of the similar measuring principles and physical properties, ToF can be efficiently paired with radars for sensor fusion applications [8, 9].

Capacitive sensors close the perception gap of the aforementioned technologies, combining tactile and close range (0 cm to 20 cm) sensing in a single modality. They can be applied as cost-effective modular and flexible on-board or retro-fitted large-scale sensing skins [10]. Simple data processing chains allow low measurement latencies and therefore faster reaction times. In contrast to vision based sensing modalities, capacitive sensors are not affected by substantial occlusions. Whereas, CPS are used for close range human perception to reduce operation speeds ahead of potential collisions [11, 12], CTS can be used for continuous monitoring of contact forces, ensuring obeying bio-mechanical limits given by safety standards and technical specifications (e.g. ISO/TS 15066 for industrial scenarios [13]). Especially, in rehabilitation robotics, where a patients’ sensation of pain may be significantly reduced, continuous force monitoring is pivotal to avoid injuries.

Accurate digital twins of these sensors would allow speeding up data collection, testing configuration changes, and generally assess their performance for the selected applications. Therefore, a proper simulation environment needs to cover both the possibility of properly reproducing the application scenarios and to fully customize sensor parameters. Moreover, a proper simulation should allow access to a certain level of signal processing. In particular, an important feature is the generation of the time domain raw radar signal, which is not considered in many sensor simulation environments. A lack of realistic sensor models decreases simulation quality and the subsequent transfer of the findings to the real world. Limitations of some state-of-the-art work in the simulation of the aforementioned sensors [14,15,16,17,18] include low resolution, absence of noise, absence of geometrical features of radar antennas and high computational complexity. In a similar manner, capacitive sensor simulations (both proximity and tactile) have seen attention in recent work [18, 19]. Current tackles include low resolution, lack of multi-touch detection or unavailability of mutual-capacitive modes as well as the lack of combining tactile and proximity sensing. Additionally inclusion of both conductive and non-conductive occlusions in simulation is limited.

This article describes the implementation of simulation frameworks to deal with the realistic sensor signal generation and processing for digital twins of the aforementioned sensors. The simulated data and the capabilities of the proposed simulation environments are compared with real-world sensor measurements and discussed.

2 Time-of-Flight Sensor Simulation

The ToF simulation, as well as the FMCW radar simulation described in Sect. 3, are based on the software Unity 3D [20]. Starting from and improving the preliminary work in [16, 21], the proposed framework provides a user-friendly platform that can be configured to simultaneously simulate the ToF and the radar from a single RGB camera object in Unity.

An overview of the framework components and their interconnections for the joint simulation is shown in Fig. 1.

Fig. 1
figure 1

An overview of the simulation pipeline to generate realistic ToF and radar data

The Unity 3D developer scene provides interactive 3D content with various objects and materials for modeling different scenarios. The illumination response of the environment to the camera object depends on configurable global illumination and material properties, such as metallic and smoothness. Unity 3D’s High Definition Render Pipeline (HDRP) performs Graphics Processing Unit (GPU)-based rendering using physically based lighting techniques.

A custom shader script retrieves the GPU buffer data and computes the depth and intensity images. Using layers in Unity, it is possible to make some objects visible for the ToF but not for the radar and vice versa. The intensity information is coded into separate color channels of the RGB render of the image, in order to separate the data to be processed by either the ToF or the radar simulation. Afterwards, the images are transferred to the Central Processing Unit (CPU) using a C# script and published as TCP/IP stream. The C# script additionally controls the settings and objects’ motion in the Unity 3D scene, which is useful for testing velocity estimation with the radar. A Matlab script receives and interprets the intensity and depth images and performs the data processing needed to generate sensor point clouds.

2.1 Sensor placement

The rendered RGB image has a resolution of \(1000\times 342\) pixels and a Field of View (FOV) of \(100^{\circ}\times 45^{\circ}\). The resolution of the camera object is chosen to be 4\(\times\) the one of the ToF, to model the Flying Pixel effect, as explained in the next Section.

Because of the similar output produced by the camera object in Unity 3D, the origin of the simulated ToF is placed coincident with the camera origin. Nevertheless, the rotation and translation of any sensor can be retrieved by rearranging the central perspective model equations [22] and using homogeneous transformations. The image is also cut according to the specific FOV of the sensor modeled. This transform-cut approach allows the user to physically add and place additional sensors without the need of further camera objects. From a single RGB camera in Unity, it is then possible to model various shapes of antenna arrays in the radar simulation, just by specifying the extrinsic parameters (rotation and translation) of each antenna element.

2.2 Signal Modeling

Amplitude Modulated Continuous Wave (AMCW)ToF cameras transmit periodic infra-red light signals, which are reflected by the environment. The incoming signal is correlated with the outgoing one to estimate the phase shift caused by light’s travel time and, ultimately, the distance traveled. By modeling the transmitted light as \(g(t)=\cos(\omega_{0}t)\) and the incoming reflection as \(f(t)=A_{T}\cos(\omega_{0}t-\phi_{0})\), where \(\phi_{0}\) and \(A_{T}\) denote the phase shift between the signals and the amplitude of the received signal, the correlation function evaluates to [21]:

$$C=\frac{A_{T}}{2}\cos(\phi_{0}+\omega_{0}\tau).$$
(1)

The values for \(A_{T}\) and \(\phi_{0}\) origin from the intensity and distance maps provided by the shader, where \(\phi_{0}=2\pi d_{ToF}/(\lambda/2)\) and \(\lambda=c/f_{0}\) is the wavelength of the modulation frequency \(f_{0}=2\pi\omega_{0}\). By selecting four observation phases as \(\omega_{0}\tau=i\frac{\pi}{2}\) for \(i=0{\ldots}3\) yielding \(C_{0}{\ldots}C_{3}\), a suitable estimator for \(\phi_{0}\) and a corresponding output distance estimator by using the estimated phase shift are given by:

$$\hat{\phi_{0}}=\mathop{\mathrm{atan}}\left(\frac{C_{3}-C_{1}}{C_{0}-C_{2}}\right),\quad\hat{d_{\mathrm{tof}}}=\frac{c\hat{\phi_{0}}}{2\omega_{0}}.$$
(2)

The distance estimation procedure is performed pixel-wise and expanded to multiple parallel pixels yielding a depth image. Additional effects modeled in the ToF simulation are the following:

  • The Flying Pixel effect, which occurs when the sensor pixel projects to an area containing depth discontinuities, e.g. a sharp edge. Consequently, the received reflections have different travel times and potentially different intensities. Thanks to the high-resolution image, which is four times higher than the ToF one, a summed contribution of four simulated pixels corresponds to a single sensor pixel signal.

  • Cross-talk effect due to the tunnel effect in complementary metal-oxide-semiconductor (CMOS) technology. The pixels of ToF cameras are grid-type positioned, so this effect is modeled using a radially symmetric Gauss Filter as isotropic approximation.

2.3 Experimental Results

The commercial sensor used to validate the results in the real world is the PMD CamBoard pico flexx ToF camera. The relevant specifications for the following experiments are summarized in Table 1.

Table 1 Relevant parameters of the used ToF camera. All the quantities are configurable in the simulation

The simulation framework is qualitatively tested in the environment shown in Fig. 2, where multiple objects and materials are present, including a very strong infrared reflector object on the right desk. The scene model in Unity 3D reproduces the indoor office environment and contains the most important items. To each object in the simulated environment, material and metallic/smoothness properties are assigned. The results in terms of the distance maps of the real and simulated ToFs are shown in Fig. 3. It can be observed that the infrared reflector, roughly at pixels (80,220) has zero depth in both cases. Due to the high intensity, the affected sensor pixels are saturating, leading to failing depth estimation. Additionally, the Cross-talk effect is present: electrons move from high intensity regions to lower intensity ones, producing a visible “cloud“ in the image. This effect can be seen in the corresponding real-world sensor output as well as in its digital twin one in Fig. 3, where the estimated depth in the region of the reflector is null and the neighboring region shows false distance values, i.e., they have the distance that the original target should have.

In overall, the simulated sensor output very closely matches the real-world data. Some differences are due to the objects’ properties, e.g. the arm of the chair is not really visible in the real camera. Such minor discrepancies can be easily accounted for by additional effort in scene modeling, and are not of concern for the proposed concept.

Fig. 2
figure 2

Real-world and simulated scene. Highlighted in green are the most important targets, including specific infrared and radar reflectors

Fig. 3
figure 3

A comparison between real world and simulated ToF camera data shows the importance of the modelled parasitic phenomena

3 Radar Sensor Simulation

The radar transmitter is modeled from the same camera object in Unity 3D, by exploiting a different layer and color channel with respect to the ToF. This approach allows to keep the synchronization between the two sensors, a feature of paramount importance for sensor fusion applications.

In the proposed simulation, a Uniform Linear Array (ULA) of receiver antennas is modeled with the transformation and cut approach described in Sect. 2. The radar FOV can be set independently from the ToF one.

3.1 Signal modeling

In FMCW radars, a sinusoidal waveform with varying frequency is transmitted for a time \(T_{c}\). Linear sweeps of the frequency bandwidth \(B\) are known as chirps. The waveform is reflected by the target and captured by one or more receiving antennas after a time delay \(\tau\), proportional to the distance and velocity of the target.

For Doppler velocity estimation, a number \(N_{c}\) of chirps can be transmitted in a single measurement frame. The received signal is mixed with the transmitted one and low-pass filtered, to provide so called Intermediate Frequency (IF) signal, whose real part is [1, 23]:

$$x_{\mathrm{IF}}(t)=A_{R}\cos(2\pi f_{\mathrm{IF}}t+\phi_{\mathrm{IF}})$$
(3)

where \(A_{R}\) is the signal amplitude, \(f_{\mathrm{IF}}\) is the constant beat frequency corresponding to the difference between transmitted and received waveforms, and \(\phi_{\mathrm{IF}}\) is the mixed signal phase. Finally, the mixed signal is sampled \(N_{s}\) times at the ADC frequency \(f_{s}\) to provide the raw time domain data which we are interested in to model. Radars allowing low level access can be configured by varying \(N_{s}\), \(N_{c}\), \(f_{s}\), \(B\) and \(T_{c}\). These parameters are chosen on the basis of desired maximum range, velocity and resolution. The proposed simulation framework allows the configuration of all these quantities. The generation of the time domain signal is summarized in the following: The mixed signal is first evaluated at each pixel, by discretizing the IF equations for all samples and chirps. For one pixel of coordinates \(\{u,v\}\), we have:

$$\begin{aligned}\tau[n]_{u,v} & =2(R_{u,v}+v_{u,v}n)n_{c}/c\end{aligned}$$
(4)
$$\begin{aligned}f[n]_{u,v} & =2\pi f_{c}\tau[n]_{u,v}+2\pi S\tau[n]_{u,v}n-\pi S\tau[n]_{u,v}^{2}\end{aligned}$$
(5)
$$\begin{aligned}x_{\mathrm{IF}}[n]_{u,v} & =A_{R}\cos(f[n]_{u,v})\end{aligned}$$
(6)

where \(1\leq n_{c}\leq N_{c}\) is the chirps index, \(n=n_{s}/f_{s}\), with \(1\leq n_{s}\leq N_{s}\) is the samples index, \(S=B/T_{c}\) is the chirp’s slope and \(A_{R}\) is the pixel intensity. The pixel’s velocity \(v\) is computed from the difference between radial distance values \(R^{k}\) and \(R^{k-1}\) obtained at two consecutive steps \(k\) and \(k-1\), with \(\Delta t\) as simulation rate. The values of \(R\) are simply given by the distance maps computed in the shader, after the transformation and cutting process. The pixels’ contributions are summed up, and everything is repeated for each simulated antenna, obtaining the raw time domain signal \(x_{\mathrm{IF}}\), reshaped into the so-called \(N_{c}\times N_{s}\times N_{a}\) radar cube, where \(N_{a}\) is the number of Rx antennas.

Radar tracking performances are highly dependent on the target detection algorithm, related to the peaks in the Range-Fast Fourier Transform (FFT) spectrum. In real sensors, low-amplitude peaks corresponding to targets with low Radar Cross Section (RCS) might not exceed the noise floor level and therefore will not be detected. The proposed simulation takes this effect into account. Noise is usually modeled at the receiver level [24] and is a mixture of thermal noise, phase noise and other effects, the sum of which is approximated as additive white Gaussian noise. Having modeled the mixed signal, White Gaussian Noise (WGN) samples are added to obtain the final time domain radar signal \(x_{R}[n]=x_{\mathrm{IF}}[n]+w[n]\). The variance of \(w[n]\) can be adjusted by the user, as its value varies with each device and in different conditions.

The obtained samples of \(x_{R}\) can be processed with standard radar signal processing algorithms for target tracking. We use 2D-FFT paired with an OS-CFAR detector and monopulse phase difference estimation to compute radial distance (range), velocity and Angle of Arrival (AoA), from which the radar point cloud is generated.

3.2 Experimental Results

The commercial sensor used to validate the results in the real world is the Infineon BGT60TR13C 60GHz FMCW radar. This small, antenna-on-chip radar is a perfect solution for short range measurements in indoor and outdoor enviroments. The relevant specifications for the following experiments are summarized in Table 2.

Table 2 Relevant parameters of the used radar. All the quantities are configurable in the proposed simulation

From experiments in the same office environment used for testing the ToF camera, the radar results are reported in Fig. 4 and 5, in terms of Range-FFT spectrum and Range-Angle map, i.e., 2D point cloud with horizontal AoA. From both figures, the generally similar behavior of the two signals is evident. Some targets are off by few centimeters, as the simulated environment is not a perfect replica, as mentioned for the ToF case. One advantage of modeling the raw data is evident from Fig. 4, where it can be seen that the real radar generally sees more low-Signal-to-Noise Ratio (SNR) reflections. These are often discarded from standard detection algorithms and ignored in post-processing data, thus actually missing information.

Another experiment is performed entirely in simulation to test the velocity estimation of a single target tracking scenario, where the camera object in Unity is programmed to perform a simple back and forth motion while tracking the target. The simulated radar is configured to send 16 chirps and the Doppler-FFT is computed. In Fig. 6 we reported the results with respect to the ground truth (computed directly in Unity). The spikes reflect the presence of noise in the raw time domain data.

Fig. 4
figure 4

Spectrum of the Range-FFT for the office scenario. Both signals contain many peaks, due to the complexity of the scene. The similarity of the effect of a low pass filter (for DC leakage component removal) and the presence of the noise floor model in the simulation data are also visible

Fig. 5
figure 5

Range-Angle map for the office scenario. The simulated data well matches real-world data, especially in terms of range; errors of a few degrees for some targets are tolerable, because of slightly different placement of objects in the real and simulated world

Fig. 6
figure 6

Velocity plot computed from the Doppler-FFT on the simulated data for single target tracking

4 Capacitive Sensor Simulation

The capacitive sensor simulation offers two modes: proximity and tactile simulation. Quite similarly to the radar and ToF simulation, the scene is represented in a 3D simulation environment like Unity3D or CoppeliaSim for the proximity simulation. The tactile simulation, on the other hand, uses a real-time physics simulator with finite element solver (such as Simulation Open Framework Architecture (SOFA) [25]) for more accurate mesh simulation. A external MATLAB or Python client is used to model the actual capacitive sensors. An overview of the simulation pipeline is shown in Fig. 7.

Fig. 7
figure 7

An overview of the simulation pipeline for capacitive proximity and tactile simulation

4.1 Theory of capacitive sensor modelling

Fundamentally speaking, capacitive sensors follow the propagation of electromagnetic fields. The electric field \(\mathbf{E}(x,y,z)\) can be therefore be expressed using Maxwell’s equations,

$$\nabla\times\mathbf{E}(x,y,z)=\mathbf{0}.$$
(7)

Typically, capacitive sensors are operated in signal range of kHz to low MHz, yielding a wavelength of 10 m to several km allowing a quasi-static simulation of the propagated field. This allows to reduce the set of equations to essentially

$$\nabla\cdot\mathbf{E}(x,y,z)=\frac{\rho}{\epsilon},$$
(8)

where \(\epsilon\) is the permittivity of the surroundings and \(\rho\) is the charge density. Following the previously taken quasi-static assumption, \(\rho\) can be determined by,

$$\rho=-\nabla\cdot(\epsilon\nabla\phi(x,y,z)),$$
(9)

where \(\phi\) is the electric scalar potential in 3D space. The equation above is commonly known as the Poisson equation of electrostatics. Actual sensing hardware obtains the electric displacement field \(\mathbf{D}(x,y,z)\), given by,

$$\mathbf{D}(x,y,z)=\epsilon\mathbf{E}(x,y,z),$$
(10)

and measures the electric displacement current. In commercial simulation programs (such as COMSOL), the surface charge \(q\) is determined by taking the closed integral of \(\rho\) on the electrode surface, assuming a planar electrode structure. The capacitance \(C\) is then calculated by,

$$C=q/V,$$
(11)

where \(V\) is the excitation voltage of the electrode.

4.2 Simulation environment

4.2.1 Proximity Sensing

The capacitive sensor is modelled by an orthographic depth camera in the physics simulation. The field of view of a rectangular electrode is roughly urn-shaped (see [11]) A cuboid shaped field of view is therefore better suited than a classical conical field-of view camera. The depth information is then transformed in a point cloud, which feeds the FEM solver used in [18]. The permittivity information of the object is stored in the red channel of the RGB‑D data. In this manner, different objects can be assigned different permittivity.

The finite elements are defined as linear tetrahedra. The approach in use [18] avoids a re-meshing procedure (static mesh) during consecutive steps in simulation, allowing a significant decrease in time consumption with minor loss in object outlines. Also, a deformation below an individual element size is not recognized. The discretization error can be tuned with the right choice of the mesh resolution.

In the next step of the approach, the stiffness matrix \(\mathbf{K}^{G}\) is pre-computed according to the operating mode of the capacitive sensor. A capacitive sensor may operate in mutual-capacitance (differential mode) or self-capacitance (single-ended mode) sensing mode. Dirichlet boundary conditions need to be set to incorporate transmitting and receiving electrodes. A permittivity vector \(\boldsymbol{\epsilon}\) is then defined to incorporate the material properties from the red stream of the RGB‑D information. During initialization, the permittivity for each element is set to \(\epsilon_{i}=1\).

Fig. 8
figure 8

Approach of the capacitive simulation. Left: The masked mesh (visualized with MATLAB). As only the front surface is captured by the camera, the entire backside is assumed to be of the same material. Right: The representative scene in CoppeliaSim. The blue box shows the bounding box of the othographic camera. The RGB image of the camera is shown in the black window

After retrieving a RGB‑D image, all mesh elements are masked with respect to the camera image, assigning the accompanying \(\epsilon_{r}\) values in the material vector. A single camera approach can detect only the surface facing the camera, this approach leaves undefined areas behind those surfaces. This does not allow to fully represent the object size. In [18], the mesh elements at the back side of the object are set as the same object. This approach is visualized in Fig. 8. Such simplification is valid when considering conductive objects (or humans). In each step, \(\mathbf{K}^{G}\) is updated and variations of the electric field can be obtained by moving the objects in the sensors vicinity. The approach in [18] updates residual entries of the matrix to speed up computations.

4.2.2 Tactile Sensing

The (real-time) simulation of CTS is a highly investigated topic and we are not yet in the stage of publishing results. Therefore, this section is more considered as an outlook and the general concept of the simulation is presented.

A CTS (see Fig. 9) is usually made up by two electrode layers (transmitter and receiver), which are separated by an deformable insulating layer, called dielectric. When force is applied, the dielectric deforms and the local thickness of the pad changes the capacitance of a nearby channel. Depending on the material of the dielectric, the relative permittivity might also be affected, which is subject to intensive experimental validation.

Fig. 9
figure 9

Schematic of a capacitive tactile sensor array with multiple electrodes. The deformable layer acts as di-electric. Each black square represents an electrode

In a similar manner, the deformed dielectric can be modelled in a real time FEM solver (such as SOFA [25]) and the deformed mesh can then be used to determine the capacitance in a quite similar manner to the proximity sensing approach.

4.3 Experimental Results

This section presents an excerpt of the results of the capacitive proximity simulation and real-world experiments. The full set of results can be retrieved from [18]. The approach and results can be seen in Fig. 10. A set of stripe-shaped electrodes were placed on the build plate of a 3D printer, and the cylinder object was placed on the extruder. Two paths (T1 and T2) were investigated and the comparison of the simulation and real-world measurements are shown in the bottom of Fig. 10. The capacitive read-out hardware in use was a USRP X310 software defined radio (SDR) with custom-made receiver daughter boards. The computer experiments were conducted on an office computer with an Intel i7 processor rated at 3.40 GHz with 16 GB RAM.

Fig. 10
figure 10

Results of the capacitve sensor simulation: Top Left: Electrode Design, Top Right: Analysed paths: Triangle (T1) and zig-zag (T2) Bottom Right: Results for path T1, Bottom Right: Results for path T2

The results show a quite good match between the simulation and real-world data for the given object and paths. The experiments do not replace simulation and real-world comparison for different sensor design, object shapes and environmental conditions. The results are nevertheless adequate for the assumptions and simplifications made and are promising for future investigations.

5 Discussion, Conclusion & Outlook

In this work, we have presented state-of-the-art simulation tools for three different sensing techniques commonly used in robotics environments: Capacitive, Radar and Time-of-Flight. These sensing modalities have their advantages and drawbacks with respect to each other, but complement each other for a more ubiquitous perception of a robots’ environment. This work went into detail on the physics of each sensing principle, describing the mathematical modelling and embedding in a simulation environment and showcasing the results for exemplary scenarios. For the given scenarios, our results show promising results with respect to real world measurements. The current limitation of these simulation tools is that not all influencing factors (such as occlusions for capacitive sensors) have not yet been considered and give necessity for future research.

Each component of the simulation environment is modular, and can be exchanged with other software or tools or even with actual hardware for a hybrid approach (combining real hardware and simulation software), assuming common communication protocols.

The necessity of digital twins for the factories the future is widely accepted. In order to acquire full function-able digital twins for tasks such as condition monitoring or even predictive maintenance, effective and precise sensor simulations are vital.