1 Introduction

Imaging in the Terahertz (THz) band is becoming a more attractive subject in various fields and offers an intense research area (Afsah-Hejri et al., 2019; Stantchev et al., 2020; Strag & Swiderski, 2023; Valusis et al., 2021; Zachary et al., 2011). Although there is great potential in this field, multiple-pixel imaging setups are costly for most users in the THz band. The use of a single-pixel imaging (SPI) configuration may offer a cost-effective method to solve this problem (Stantchev et al., 2020). Yao et al. (2024) integrates machine learning algorithms with THz SPI techniques for identfying linear patterns using a small number of single-pixel values. Simulations and experiments demonstrated that the proposed scheme achieved a classification accuracy of 90%. Deng et al. (2023) introduces an original approach to terahertz imaging called high-efficiency THz SPI. The researchers generate spatial light patterns using physics-enhanced deep learning networks. This method combines a model-driven fine-tuning process with a physics informed layer, resulting in high-quality imaging. However, compared to conventional imaging techniques, implementation of this methodology necessitates more complex hardware requirements. Although SPI provides an affordable solution, the data collection process required for reconstructing the images is time consuming because it requires the number of measurements to be at least equal to the number of pixels in the image (Hu et al., 2022; Yang et al., 2020; Zanotto et al., 2020a). Over the years, various methodologies have been developed for enhancing the efficiency of image acquisition time in the THz band. The Compressive Sensing (CS) technique, where images can be generated using fewer samples than Nyquist's theorem requires, offers a potentially important solution (Baraniuk, 2007; Candes & Wakin, 2008).

Single-pixel THz imaging setup based on CS was first developed by Chan et al. in 2008. The suggested configuration in Chan et al., (2008a) can reconstruct 4096 pixels THz images with only 500 measurements. By employing a set of binary metal masks and reconstructing THz images with 300 measurements, Chan et al. (2008b) enhanced this method. Lu et al. (2020) achieved accurate reconstruction of single-pixel THz spectral images by combining the CS and the inverse Fresnel diffraction (IFD) algorithm. In another study, Zanotto et al. (2020b) demonstrated that the combination of single-pixel imaging with compressive sensing algorithms can reduce the complexity of current THz imaging systems. While using CS in the single-pixel imaging configuration has made significant advancements, it still suffers from long collection and reconstruction time due to the large number of measurements, particularly for large-sized images. Our proposed solution to this problem, using a caustic lens effect induced by the perturbation in a ripple tank as a sampling mask. The dynamic nature of the ripple tank generates intricate caustic patterns. These patterns function as a caustic lens mask, introducing randomness into the sampling process. This reduces significantly in measurement time by exploiting the inherent sparsity of THz band signals.

Upon recording the signals after they pass through the caustic lens mask, distinct signal patterns have been observed for different targets in our experimental setup. In this work, Convolutional Neural Network (CNN) has been applied to perform the classification task using the extracted features based on these distinct signal patterns. CNN has emerged as a powerful tool for classifying THz data due to their ability to automatically categorize complex spectral information captured by THz imaging systems. Shen et al. (2021) presented a new method that combines terahertz spectral imaging and a convolutional neural network (CNN). The researchers found that this method can effectively identify impurities in wheat with an accuracy of 97.83%. Kubiczek et al. (2022) proposed using CNNs for automated feature extraction and classification of materials from THz images. The CNN-based approach achieved 98% accuracy in classifying various materials based on their THz images. Wang et al. (2021) explored the application of CNN for classifying liquid contraband using THz images for security inspection purposes. When tested on a dataset with seven different concentrations of ethanol solutions, the proposed framework still performs well, achieving an accuracy of 97.14% despite the low SNR. Although CNN has been largely used in tasks related to two dimensional images (Maggiori et al., 2017; Scott et al., 2017), it can be applied to one-dimensional signals such as time series or sequential data with suitable modification (Huang et al., 2019; Jun et al., 1804). In their work, Liu et al. (Liu et al., 2022) implemented deep learning for breast cancer tissue classification and employed a wavelet synchro-squeezed transformation (WSST) as a preprocessing step for the terahertz data. This transformation converts the time-sequential data from each terahertz pixel into a spectrogram representation. The resulting spectrograms, obtained through WSST, are then used as input tensors for the CNN. In our study, we employed the Continuous Wavelet Transform (CWT), one of the commonly used methods for converting 1D signals into 2D images suitable for input into a CNN (He et al., 2018; Li et al., 2019; Zhao et al., 2019). A unique aspect of our research differentiates from studies mentioned above (Huang et al., 2019; Jun et al., 1804; Kubiczek & Balzer, 2022; Liu et al., 2022; Maggiori et al., 2017; Scott et al., 2017; Shen et al., 2021; Wang et al., 2021), is incorporation with caustic lens mask in the imaging setup. The results revealed that our classifier has achieved 99.22% accuracy for classification of targets in the form of Latin letters. The use of the caustic lens mask in the imaging setup significantly contributed to achieving this high accuracy. Controlled randomness, which results from the dynamic nature of the ripple tanks, helps to reduce overfitting, a condition when a model performs well on the training domain but poorly on unknown data.

The remainder of this paper is organized as follows: Our proposed method is explained in Sect. 2. Section 3 provides detailed information about the experimental setup used in this study. In Sect. 4, suggested Convolutional Neural Network architecture is presented. The obtained result from the proposed method is given in Sect. 5. Section 6 concludes the paper.

2 Proposed method

Researchers have traditionally studied waves with an instrument known as a ripple tank (Kuwabara et al., 1986). A ripple tank produces waves when a vibrating object disturbs the water surface. These waves then propagate outward, generating ripple patterns that can be observed. Observing ripples generated in a ripple tank can offer qualitative insights into how light refracts as it passes through water. The peaks (crests) of waves represent regions where the water surface is elevated. As light interacts with these elevated regions, it undergoes refraction, bending towards the normal. This bending effect is like the converging effect of a convex lens in optics. Conversely, the troughs of waves represent regions where the water surface is lowered. Light interacting with these areas also undergoes refraction but in the opposite direction, bending away from the normal. This is similar to the diverging effect of a concave lens in optics.

In summary, the peaks and troughs of waves in a ripple tank mimic the basic principles of lens behavior. This lensing effect is a result of the refraction of light as it interacts with the varied topography of the water surface created by the ripples in the ripple tank. Instead of using water in this research, we used a ripple tank filled with mineral oil. While water exhibits absorption characteristics in the THz band, the refractive index of mineral oil is more suitable for creating lensing effects at the working frequency.

A caustic lens refers to the optical phenomenon that occurs when light is refracted or reflected by a surface with varying curvature or refractive properties. This phenomenon leads to the formation of concentrated patterns of light intensity known as caustics. The dynamic ripples in the tank act as a caustic lens, influencing the trajectory and characteristics of the THz waves as they traverse the oil surface. CS with random sampling masks finds applications in scenarios where acquiring a full set of measurements is impractical or costly. In this work, the caustic lens effect generated with a mechanical arm is used as a random sampling matrix.

The inherent sparsity of signals in the THz bands allows the utilization of solutions based on Compressive Sensing (Sarieddeen et al., 2021). Proposed CS imaging setup is shown in Fig. 1. This imaging assembly consists of a THz transmitter and receiver structure, a target, a ripple tank with electronically controlled mechanical arm, a collimating lens and a focusing lens.

Fig. 1
figure 1

Proposed CS Imaging Setup consists of a THz transmitter and receiver structure, a target, a ripple tank with electronically controlled mechanical arm, a collimating lens and a focusing lens

The measurement process in CS involves acquiring a reduced set of linear measurements y of the signal x with optimization procedure explained in Donoho (2006). Mathematically, this can be summarized as (Chan et al., 2008b):

$$y= M*x$$

where x is the original signal vector of length n; y is the measurement vector of length m, where m < n (n is the original signal length). M is a random sampling matrix of size m\(\times \)n.

The choice of the random sampling mask has a significant impact on the performance of CS. The non-uniformity introduced by the caustic lens mask enhances the sparsity of signals reaching the receiver of the setup. The sparsity of the signal is a critical concept in CS, implying that the only specific regions of the signal actively contribute to the measurements. This enables CS to focus on the essential, non-zero components of the signal.

After acquiring signals from the receiver of the setup, they are converted into scalograms, using the Continuous Wavelet Transform. RGB image generation from scalograms is achieved by mapping the CWT coefficients onto color channels. The resulting images are then fed into a CNN for classification of different shaped targets.

The Continuous Wavelet Transform (CWT) is used for a multi-resolution analysis of the signal. The CWT can capture both frequency and time-domain information, making it a suitable technique for converting signals into images (Rhif et al., 2019). This approach is commonly used in signal processing applications where one-dimensional signals need to be analyzed using image processing techniques.

The CWT computes the inner product between the input signal and the wavelets that are scaled and translated to different positions and scales (Mallat, 1999). This process creates a two-dimensional representation called a scalogram, which shows the strength (magnitude) of the signal at different time–frequency locations. The following equation describes CWT:

$$W\left(a,b\right)=\frac{1}{\sqrt{a}}\int \psi (\frac{t-b}{a})x(t)dt$$

where W is the wavelet transformation of input signal, a is the scaling factor, b is the time shift factor, Ψ is the mother wavelet function and x(t) is the input signal.

When the wavelet is contracted (a smaller than 1) the wavelet offers high spectral resolution, when the wavelet is dilated (a bigger than 1) the wavelet offers high temporal resolution. In the first scenario, it is ideal for capturing transient events, while the second is ideal for identifying frequencies in steady state.

Convolutional neural network (CNN) is a specialized type of deep learning architecture designed for analyzing visual data (Goodfellow et al., 2016). Inspired by the human visual system, CNN is proficient at automatically learning hierarchical features and patterns from input images. A traditional CNN architecture is composed of an input layer, an output layer and multiple hidden layers.

The first layer, referred to as the input layer, receives the pixel values of the images. After the input layer, the convolutional layer extracts local features from the data using sliding filters. The acquired features are then passed through batch normalization layers, normalizing activations to improve training stability and accelerate convergence. Activation layers, such as ReLU (Rectified Linear Unit), bring non-linearity to the network, enabling the learning of complex patterns and discriminative representations. Additionally, pooling layers are incorporated to downsample the feature maps, reducing the spatial dimensions while preserving the most significant information. The pooling operation, often implementing max pooling, identifies the maximum value within a predefined neighborhood, effectively capturing the most significant features.

The final layers of the CNN consist of fully connected layers and the output layer. The fully connected layers integrate the acquired features from previous layers, gradually shaping them into a compact representation. The output layer, typically a softmax layer, produces class probabilities based on the acquired features.

3 Experimental setup

The CS Transceiver structure of our experimental setup is demonstrated in Fig. 2.

Fig. 2
figure 2

Compressed sensing transceiver structure

On the transmitter side (Fig. 3), the dielectric resonator oscillator (DRO) is used to generate the local oscillator (LO) signal in the 60 GHz upconverter structure. A LO signal with a frequency of 8.3 GHz is obtained at the DRO output.

Fig. 3
figure 3

Compressed sensing experimental setup-transmitter

The LO signal at the DRO output is connected to the Miteq DM0408HW2 mixer input, which operates at frequencies between 4 and 8 GHz and has a conversion loss of 5 dB. Here, the LO signal is multiplied by the signal from the signal generator. There is a 14 dB attenuator at the signal generator output. The attenuator is used to prevent the generated signal from damaging the mixer.

The signal at the mixer output is connected to the Picosecond 5840B Broadband amplifier input, which operates at frequencies between 80 kHz and 13.5 GHz and has 21 dB gain and 5.8 dB noise figure parameters. Then, the frequency of the signal obtained at the amplifier output is moved to the 24.9 GHz band using the Pacific Millimeter MRF01456 frequency tripler with a frequency product factor of 3.

The signal obtained at 24.9 GHz is amplified using a Microwave Power Solid State amplifier operating at frequencies between 27 and 32 GHz with an output power (Psat) of 32 dBm.

The sub-harmonic mixer provides efficient RF to IF or IF to RF conversion using an LO at 1/2 the normal frequency. Thus, the 24.9 GHz signal obtained is upconverted into a 49.8 GHz mmWave signal using a sub-harmonic mixer operating in the 50–75 GHz RF frequency range.

Finally, the sub-harmonic mixer output is connected to the WR15 horn antenna, which operates in the frequency range 58 GHz–68 GHz and has a gain of 15 dBi.

The signal sent from the transmitting part passes through the collimating lens to increase its strength. A ripple tank filled with transformer mineral oil (Fig. 4) is positioned between the collimating and focusing lens. The ripple tank is filled with transformer mineral oil instead of water because water is absorptive at THz frequencies whereas oil is an almost perfect transmission medium at our working frequencies of 50–70 GHz.

Fig. 4
figure 4

Ripple tank filled with transformer mineral oil

The signal passing through the caustic lens is dropped on the object to be imaged. Five distinct letter-shaped targets are used in this experimental setup (Fig. 5). Then, the signal received by the receiver side after passing through the focusing lens.

Fig. 5
figure 5

Five distinct letter-shaped targets are used in this experimental setup

On the receiver side (Fig. 6), the upconverted signal transmitted in the THz band is sent to the downconverter using a horn antenna operating in the frequency range 58 GHz to 68 GHz. The Sub-Harmonic Mixer is connected to the RF port to convert the RF signal to IF signal. A microwave oscillator with a frequency of 8.218 GHz is used to obtain the receiving LO signal. The signal at the output of the oscillator is moved to the 24.654 GHz band using the Pacific Millimeter MRF01458 frequency tripler with a frequency product factor of 3. Using this LO signal generated in the receiver, the 49.8 GHz transmitted RF signal is downconverted.

Fig. 6
figure 6

Compressed Sensing Experimental Setup-Receiver

Following the acquisition of data from the receiver side, the next stage is to create the scalogram with colorized CWT technique. The CWT has been implemented to the acquired signals, using the filter bank approach in the MATLAB wavelet toolbox. The input parameters used are Signal Length, Wavelet Type, and Voices per Octave. Table 1 shows the parameters used in this study. The functions developed for obtaining the CWT through a filter bank enable employing the analytical wavelet families, including Morse wavelet, Gabor Wavelet and Bump wavelet. The scales in the CWT are discretized based on the number of voices per octave. The minimum and maximum scales in the filter bank are chosen automatically by the energy spread of the wavelet in frequency and time.

Table 1 CWT Filter Bank Parameters

The analysis of the signal involves using the coefficients produced by the CWT. MATLAB’s jet 128 style colour maps have been used for examining the energy scales in the scalogram. This procedure is applied to the absolute values of the wavelet coefficients.

Our dataset comprises 1280 scalogram images, divided into five distinct classes with a balanced distribution of 256 images per class. These images are stored in designated folders corresponding to their target classes. Each image has a dimension of 32 × 32 pixels.

4 Convolutional neural network architecture

The architecture of the CNN used to classify colorized CWT images is shown in Fig. 7. Table 2 details the configuration of each layer, including their configurable parameters.

Fig. 7
figure 7

Architecture of the CNN

Table 2 Layers and Configurable Parameters of Our Architecture

The network starts with an image input layer, expecting input images of size 32 × 32 with three color channels (RGB). The first convolutional layer employs a 5 × 5 kernel, featuring 8 filters and same padding to retain spatial dimensions. Following convolution layer, Batch Normalization ensures stabilized activations for subsequent layers and rectified linear unit (ReLU) layer introduces non-linearity. Subsequently, a max pooling layer with a 2 × 2 window and a stride of 2 reduces spatial dimensions while preserving essential features. The second convolutional layer deploys a 3 × 3 kernel with 4 filters and same padding, replicating the structure of the initial layer. Batch normalization and ReLU activation follow suit, ending in another max pooling layer with identical configurations for further dimensionality reduction.

The latter layers include a fully connected (FC) layer, responsible for weighted summation and activation, followed by a softmax layer for probabilistic output transformation, facilitating multiclass classification. The final classification layer assigns a category to the input image based on the probability scores derived from the softmax layer.

The CNN has been trained using the Stochastic Gradient Descent with Momentum (SGDM) optimizer, which is a common optimization algorithm for deep learning (Goodfellow et al., 2016). The decision has been based on the speed of training and the simplicity that the method updates the weights. The number of epochs plays a critical role in determining the model's generalization performance. Figure 8 shows the relationship between the number of epochs and corresponding loss values. As observed from Fig. 8, the loss curve plateaued after 20 epochs, indicating the model has converged to a stable state, where the loss function is no longer decreasing significantly. To prevent overfitting, the maximum number of training epochs is set to 20. The initial learning rate is set to 0.01. The validation data is used to prevent overfitting and is randomly selected from the training data.

Fig. 8
figure 8

Relationship between the number of epochs and the corresponding loss values

The training of the network used in this study is carried out by the following steps: Firstly, the training data is divided into training and validation sets. Then, the weights and biases of the CNN are initialized. Following initialization, forward propagation is performed to calculate the output of the network for a given input. This includes passing the input through the convolutional layers, applying activation functions and utilizing pooling operations. The loss between the predicted output and the actual labels is computed through a loss function. Following this, backward propagation is done to propagate the loss backward over the network, computing gradients.

The network's weights and biases are iteratively updated based on these computed gradients and SGDM algorithm, with the main objective of minimizing the loss function. This iterative training procedure is repeated for 20 epochs. Periodic evaluation on a validation dataset is conducted to measure the network's performance, with continuous monitoring of performance metrics to evaluate the model's generalization ability.

The computer used for the training and classification stage is an Intel Core i5 processors 2 GHz with 16 GB RAM and 64-bit Windows 10 operating system.

5 Results

In this study, the fivefold cross-validation technique has been employed to evaluate the classification performance. To carry out cross-validation, the dataset is partitioned into five folds. One-fold is used for testing the network, while the remaining four folds are used for training the network. By iterating this procedure five times, an estimation of the model’s accuracy has been obtained that is less prone to bias than if only one data split is used.

At the end of each iteration, a confusion matrix has been obtained. The confusion matrix allows a numeric visualization between predicted classes and real classes of the test set. 5 different confusion matrices have been obtained, one for each fold. When presenting the overall classification performance, it is common to calculate the average of the confusion matrices obtained from each fold. This approach provides a more robust estimate of the performance of the classifier and reduces the impact of random variations that may occur in a single fold. Figure 9 illustrates the average confusion matrix obtained for the network.

Fig. 9
figure 9

Average Confusion Matrix

To evaluate the classification performance for each label, the accuracy, recall, precision, and F-measure metrics have been calculated by using the average confusion matrix present in Table 3.

Table 3 Performance metric results for each label

These metrics provide insights into the model's performance in terms of correctly predicting instances belonging to each label, capturing the true positive rate, precision of positive predictions, and a balanced measure of precision and recall, respectively.

The use of the average confusion matrix for computation of performance metrics enables a more robust and representative assessment, accounting for variations within individual folds. The calculated accuracy, recall, precision, and F-measure provide valuable insights into the overall performance and effectiveness of our classification.

Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017) is a powerful visualization technique that sheds light on the inner workings of deep learning classifiers. By analyzing gradients flowing into the last convolutional layer of the CNN, Grad-CAM provides valuable insights into how classifiers perceive and interpret visual information. Figure 10a presents a scalogram image, while Fig. 10b displays a corresponding Grad-CAM interpretation. The white regions in the Grad-CAM visualization highlight the areas of the scalogram to which the network attributes the most significance during the classification process. Figure 10b indicates that the network appears to predominantly focus on two horizontal lines within the scalogram. These lines likely correspond to the frequencies that are most prominent in the analyzed signal.

Fig. 10
figure 10

a Shows a scalogram image and b shows a Grad-CAM interpretation of scalogram image

Our CNN classifier achieved an accuracy of 99.22% in correctly classifying targets in the I shape, with a misclassification rate of approximately 0.78% as T-shaped targets. Achieving 100% accuracy for H-shaped and O-shaped targets underscores the robust performance of the classifier in this category. While the classifier exhibited a success rate of 98.83% in classifying T-shaped targets, it occasionally misclassified some as F shape. The lowest classification accuracy, at 98.05%, has been observed in the classification of F-shaped targets.

To assess the effectiveness of our CNN classifier, we applied comparative experiments with state-of-the-art models [ResNet-18 (He et al., 2016), ShuffleNet (Zhang et al., 2018) and GoogLeNet (Szegedy et al., 2015)]. All convolutional neural network architectures underwent training and testing procedures using identical datasets to ensure consistency in evaluation. Figure 11 provides a comprehensive summary of the accuracy metrics and corresponding training times for each model. The results, depicted in Fig. 11, demonstrate that our network requires less training time compared to other State-of-the-Art Models. This reduced training time translates to substantial computational savings, making our network a more resource-efficient choice.

Fig. 11
figure 11

Comparison of the proposed method against the state-of-the-art models

6 Conclusion

According to our literature review, it is observed that single-pixel THz imaging configuration still faces challenges related to the time-consuming data collection process required for image reconstruction. This stems from the necessity of acquiring several measurements that exceed the total number of pixels in the final stage. CS emerges as a powerful technique for image acquisition because it achieves high-quality image reconstructions even with a reduced number of samples compared to traditional methods. Despite the remarkable progress achieved by incorporation of CS into SPI configuration, challenges remain regarding extended data collection and reconstruction times. To solve this problem, we propose to use the caustic patterns obtained by the perturbation in the ripple tank and this create a caustic lens effect in CS imaging setup. This caustic lens effect introduces randomness into the sampling process, which can significantly decrease measurement time by exploiting the inherent sparsity of signals in the THz band.

Data obtained from the receiver side has been converted into images using the colorized CWT technique. The resulting images have been then fed into a CNN for classification of different shaped targets. This approach is commonly used in signal processing applications where one-dimensional signals, such as time series data, need to be analyzed using image processing techniques. We have been able to achieve overall 99.22% classification accuracy for different shaped targets. Interestingly, Grad-CAM analysis revealed that the CNN primarily focuses on two distinct horizontal lines in the resulting scalogram image, suggesting these lines likely represent the most prominent frequency components in the signal. Furthermore, to evaluate the performance of our CNN classifier, we conducted comparative experiments with established state-of-the-art models. The CNN demonstrates a significant advantage in training speed compared to existing state-of-the-art models. This translates to substantial computational cost savings, making it a more resource-efficient alternative.