Transparent Object Reconstruction Based on Compressive Sensing and Super-Resolution Convolutional Neural Network

The detection and reconstruction of transparent objects have remained challenging due to the absence of their features and variations in the local features with variations in illumination. In this paper, both compressive sensing (CS) and super-resolution convolutional neural network (SRCNN) techniques are combined to capture transparent objects. With the proposed method, the transparent object’s details are extracted accurately using a single pixel detector during the surface reconstruction. The resultant images obtained from the experimental setup are low in quality due to speckles and deformations on the object. However, the implemented SRCNN algorithm has obviated the mentioned drawbacks and reconstructed images visually plausibly. The developed algorithm locates the deformities in the resultant images and improves the image quality. Additionally, the inclusion of compressive sensing minimizes the measurements required for reconstruction, thereby reducing image post-processing and hardware requirements during network training. The result obtained indicates that the visual quality of the reconstructed images has increased from a structural similarity index (SSIM) value of 0.2 to 0.53. In this work, we demonstrate the efficiency of the proposed method in imaging and reconstructing transparent objects with the application of a compressive single pixel imaging technique and improving the image quality to a satisfactory level using the SRCNN algorithm.


Introduction
Transparent object detection and reconstruction is a field of interest in mechanical and optical engineering, computer vision, and many other applications owing to its presence everywhere in the real world. Diverse approaches such as triangulation [1,2], stereovision [3][4][5], single view, and multiple view methods [6][7][8] were adopted to discover the shapes of glass objects with complex exterior and interior properties. Many of these techniques were originally developed for solving opaque object reconstruction problems [9,10] and later utilized on transparent objects. Adopting such reconstruction techniques without considering transparent objects' characteristics made it unsuccessful in finding a complete solution.
Among the developed techniques, some of the methods are reviewed as follows. To extract the surface details of the transparent object, laser heating and thermal imaging were utilized in this work [11,12]. Initially, the thermal camera was calibrated, and 3-dimensional (3D) coordinates were located to obtain the 3D features of transparent objects. Transparent objects were opaque to the laser light source and once the object started emitting radiations in a particular direction, the thermal camera observed target objects. The stereo vision algorithm is a commonly used technique for reconstructing 3D transparent objects where the time of flight (ToF) camera takes the images of the target from the left and right views [13]. Normally, the ToF camera is insensitive to transparent objects under ambient light conditions and it assures improper results. On the other hand, the proposed SR4k (SwissRanger 4000) ToF camera has captured good quality images of transparent objects under infrared (IR) light conditions. The current approach faces drawbacks of improper results for the object with occlusions, cluttered scenes, and different surfaces. In [14], a Fourier single-pixel imaging method was developed for acquiring the 3D shape of the translucent objects and the work accuracy was compared with fringe projection profilometry (FPP). Multiple surfaces and sub-surface scattering in FPP deformed the fringe patterns, thereby 3D shape of the resultant image was also affected. However, the phase deviations in the work were obviated through Fourier single pixel imaging using four-step phase shifted sinusoidal fringe patterns. The results of the work indicated its accuracy.
Direct-ray measurements were also used to recover the shape of transparent objects where the sensor recorded the transmitted light intensity after passing through the object. Ray measurements were taken from multiple views by keeping the camera around the object to approximate the size or depth from optical flow [1,15]. Later, Kutulakos et al. [1] improved the method to deal with the 3D transparent object reconstruction problem wherein to reconstruct the scene, at least three views were necessary when light rays underwent two refractions. He also proved that the method was unsuccessful if the refractions were more than twice. Soon after, IR cameras were implemented to capture the transparent scenes by triangulation. Though the results were proved to be reliable, the cost of the IR camera limited its applications for practical purposes. Recently, researchers have explored single-pixel imaging in which imaging is possible even beyond the visible wavelength with an inexpensive photodetector. Single pixel imaging has used qualities to create high-quality images with a single sensor from a considerable distance, such as exacting light alignment to the sensitive area of the detector, minimal losses from the optics, and knowledge of predetermined patterns.
However, the real images reconstructed from the single pixel experiments under poor light conditions are affected by noise, blur, and speckles. Computer vision tasks such as image classification, segmentation, object detection, and tracking require high-quality images. Hence, the reconstructed images have been further processed widely for visual quality improvements. The techniques used for image processing are generally classified as histogram-based and retinex-based algorithms [16,17]. In histogram-based algorithms, the contrast of images is enhanced by modifying the luminous intensity present in the real images. On the other hand, illumination and reflectance components in the real images are extracted, and then the reflectance components are treated for image quality enhancement. The captured images under natural light conditions are often corrupted due to environmental disturbances, camera quality, and other speckles. Recently developed low light image enhancement method (LIEM) estimated the amount of light illumination falling on each pixel and refined the illumination map for image quality enhancement [18].
Though the above-mentioned techniques capture transparent scenes, the resultant image quality is contaminated owing to surface and subsurface reflections as well as transmissions. Thus, researchers are recently focused on deep learning techniques to get quality enhanced results. Super-resolution convolutional neural networks [19][20][21] was implemented for improving image quality with image quality assessment (IQA) techniques. Later, Zhang et al. [22] described detection and quality improvements of floating objects such as garbage and plastic bottles on water surfaces based on an improved faster region-based convolutional neural network (FRCNN) network. To improve detection accuracy, higher-level and lower-level features in the datasets were integrated along with the gamma correction algorithm. However, a single set of datasets has limited the generalization of the work in categorizing floats on the water surface. Also, gamma values for each image have to be determined before data analysis. Moreover, feature fusion and feature map extraction in each convolutional layer made a lengthy computation process. Lai et al. [23] aimed to find transparent objects' locations in the color image.
Owing to the specular characteristics of the target object, computer vision methods were confronted with difficulties in identifying such glossy objects. Hence, FRCNN networks were implemented in this work to recognize the object. The developed Translab algorithm for segmenting the transparent objects from other real-world objects was successful over other 20 existing algorithms. The datasets used for training the deep neural network were real and had 10 428 images. With the developed algorithm, the robots could identify transparent objects on their way [24]. The difficulty in detecting transparent objects was overcome in this work by the introduction of a convolutional neural network. To do that neural network was trained with images of both transparent and non-transparent objects with its annotation file describing the location of each object. The detection result was analyzed using the class label having a value between 0 and 1. The object would be identified as a transparent class if the detected value was between 0 and 1. However, there were false detections when the shape of the opaque object was very similar to the transparent objects [25]. The authors in [26] have proposed a method to recover the depth and shape of a translucent object using a ToF. Owing to the complex light components from different parts of the target and depth information from the camera, the resultant image quality was affected. Hence, a deep residual network is adopted by training the network with images acquired from the Kinect sensor. Furthermore, the accuracy of the work is evaluated for various object poses, noise levels, and optical properties.
In this paper, we combine compressive sensing (CS) and SRCNN to reconstruct the transparent object. The designed optical transparent imaging system samples the object compressively and approximates the images based on the total-variation (TV) minimization algorithm. The reconstructed images are degraded due to speckles. Hence, the SRCNN algorithm is applied to the compressively reconstructed images for quality enhancement. Our results validate the accuracy and effectiveness of combining CS and SRCNN in producing good-quality images. The organization of the paper is as follows. Section 2 describes the theory behind the transparent object reconstruction and the proposed SRCNN network. In Section 3, the experimental setup is showcased followed by transparent object reconstruction results based on compressive sensing and SRCNN in Section 4. Section 5 includes the conclusions of this paper and future works.

Compressive sensing
In recent years, there has been a massive increase in the number of sensors being employed and the dimensionality of data that is required. As a result, many applications are drowning in a flood of data. Hence, strategies that effectively handle a large volume of data have attracted people's curiosity. To address the challenge, high dimensional (Nyquist rate) data should be compressed to low dimensional models without losing the significant information in the signal/image. Most of the natural images can be approximated using very low dimensional data by considering the sparsity in the images. That is known as CS [27,28]. In CS, the image can be well-approximated when it is sparse on a basis. To recover such images, only the non-zero sparse co-efficient would be acquired with minor information loss. For the acquisition process, a pre-defined pattern that obeys restricted isometric property and incoherence is projected onto the object to be imaged and collected for the sampled linear measurements.
The mathematical introduction of CS is explained as follows: in Fig. 1, consider the transparent object to be imaged as X having n dimensions belonging to the vector space  n and it can be compressively sensed as a vector of length belonging to the vector space  m [29].
where y represents m linear measurements taken using a single pixel detector; A and X indicates pre-defined patterns and the object to be imaged, respectively. Equation (1) can be illustrated as in Fig. 1 and re-written as in (2): ( The measurement matrix i A is random and it is linearly combined with the transparent object i X , which obtains a system of equations. However, as mentioned above, A has fewer rows than columns solving out X utilizing y and A seems impossible because of an infinite number of solutions. For addressing the issue, the principle of CS is introduced which can be used effectively for sparse images. Sparsity defines the representation of an image on an appropriate basis, where most of the information in images is close to zero. When X has k-sparsest information, it can be recovered with ( lg ) M O k N  linear measurements [29] and with the aid of a TV minimization algorithm. The algorithm is expressed as In the ideal case, the image can be obtained using (3). Practically, the measurement data are subjected to interference occurring from dark current in detectors, oscillations of the light source, and electrostatic noises within cables. Thus, (3) with error factor  can be rewritten as : With each iteration of the minimization problem, the accuracy of the reconstructed image * X is obtained by comparing the measured signal with the predicted signal according to known patterns used for sampling. The object's signals are weak and noisy in the case of transparent items. The TV minimization technique is effective at eliminating noise and rebuilding denoised images while maintaining image features. Image is approximated based on the given formula: Thus, the reconstructed images obtained would be of better quality.

Super-resolution convolutional neural network
Convolutional neural networks (CNN) dominate computer vision research all around [30], and deep CNN has been gaining popularity in recent years in computer vision areas such as image classification, face recognition, and object detection [31,32]. Several factors have played an important role in this progress: the availability of ample datasets for training the network model, discovery of efficient activation functions like ReLU to make convergence faster, and usage of cost-effective graphics processing unit (GPUs). Also, SRCNN is implemented to obtain high-resolution images under low light conditions [33]. In this work, high-frequency components in the low-resolution images are extracted to give more attention while processing. Color images are also investigated in this work. The network shows superior performance in achieving similarity between real and reconstructed images. Rather than simply extracting low-level or high-level features, all features are obtained and combined through dense skip connections [34]. Normally, the computational time and cost will increase while feeding all features to the deconvolutional layers. To obviate such problems a bottleneck layer is implemented to reduce the number of features to 256. Then, deconvolutional layers convert low resolution (LR) space to high resolution (HR) space. Later, the reconstruction layer recovers HR images from the HR space. Despite multiple image restoration networks, SRCNN is adopted in the current research because it optimizes the problem faster and it is the best fit for low-resolution images under low light conditions. Normally, the images formed from single-pixel imaging suffer from low light conditions, blur, and noise. Our goal is to enhance the visual quality of those images with the aid of SRCNN. The available image denoising algorithms are satisfactory for raster-scanned images, however, they cannot be implemented directly on attained single-pixel camera images. Thus, the proposed SRCNN network is introduced for improving visual effects. The general framework of the SRCNN is shown in Fig. 2.  Fig. 2 General framework of the SRCNN network.
The basic SRCNN architecture mainly consists of the following layers such as image scaling, feature extraction, non-linear mapping, and reconstruction. Normally, single-pixel images must be rescaled with bicubic interpolation. A large dataset describes the bicubic interpolated images that have to be reduced to speed up computation. In the feature extraction process, the dimensionality reduction of the input image takes place by keeping the vital features that describe the original input image. This step of convolution is performed using the formula: where X is the low-resolution single-pixel image, ω 1 is the kernel for the convolution, B 1 is the bias, f is the activation function, and * denotes the convolution operator. The activation function used in this work is ReLU. ReLU will output input data if it is positive or zero if it is negative. The specific formula for this is In the non-linear mapping process, mapping of the extracted low dimensional features to high dimensional features is performed. The basic formula is shown as follows: where 2 ( ) F X is the output of the previous layer, 2  is the kernel for the non-linear mapping process, and 2 B is the bias. The last layer in the network is image reconstruction where visually enhanced images are recovered using the formula where ( ) F X is the reconstructed visually enhanced image, 3  is the kernel for the image reconstruction layer, 2 ( ) F X is the output of the non-linear mapping layer, and 3 B is the bias. Both simulation and experiments on compressed single-pixel images show that images reconstructed are good in quality using the SRCNN network.

Transparent object detection system setup
The schematic illustration of the transparent object inspection system can be seen in Fig. 3. Similarly, we set up it in the lab that is shown in Fig. 4. While conducting the experiments, the green laser serves as a light source which is fixed at an optical table to illuminate the micromirrors in the digital micromirror device (DMD). The DMD (Texas instruments DLP 6500) with over 2 million micromirrors acts as a projector to project computer-generated pre-programmed patterns. To illuminate the micromirrors in the DMD, the transmitted light intensity from the object is collected and focused on the illumination area of the device. As the DMD is connected to the computer, the generated patterns from the computer are loaded into the DMD and projected when there are an illumination light source and an input trigger signal. Then, the projected pattern from the DMD combines with the object information and enters onto a focusing lens. The focused light passes through a beam splitter where the light beam splits into two: one goes to the single-pixel detector (SPD) (Thorlabs PDA36A2) through the number of focusing lenses (LA1740, f-85mm) and the other goes to a light absorber. The beam splitter is introduced into the system setup to reduce the light intensity to a manageable level for the single-pixel detector otherwise the detector goes into a saturation mode where the detector is fully saturated allowing maximum light. The detected light intensity is converted to the voltage at the detector and sent to a data acquisition device (DAQ USB600). The DAQ digitizes the measurements and delivers them to a processing device for analyzing the raw data. To obtain 3D images of the object, the object should be placed at a rotating platform and images from different perspectives. are illuminated with pre-programmed patterns having white and black pixels, the mirrors impinged with white pixel tilt to +12 and cause light to fall on the object for sampling. On the other hand, the mirrors impinged with black pixels tilt to -12 and redirect light to a light absorber in the DMD. The DMD acts as a projector to display the pre-programmed highly spatially resolved patterns for sampling the object. That means, the DMD device reflects the required light, and at the same time, absorbs the undesired light with a light absorber to realize the projection of the image. The light direction is through the electrostatic effect by controlling the angle of the micromirrors. To make the system compressive, the object should be sampled with patterns, hence micromirrors in the DMD help in randomizing sampling by projecting the light onto the object from the "on" pixels in the micromirrors. The light transmitted from the object after sampling is collected by a focusing lens and directed to a single-pixel detector. The alignment and specifications of the lenses and other optical components in a single-pixel imaging system play a vital role in ensuring good quality imaging. The transmitted light beam from the object is collected and directed to the sensitive area of the single pixel detector. As mentioned earlier, to obtain a better quality result, the optics and other devices used in the setup should be selected carefully so that transmitted light from the object would be directed to the sensitive area of the sensor with fewer aberrations. The measured voltage by the single-pixel detector integrates the products between the target object and the basic pattern to compute the measurement vector for each pattern. Analog to digital converter at the end digitizes the measurements for further processing. The quality of the final reconstructed images is determined by the patterns chosen. Over the years, various patterns emerged, including random patterns, chaotic patterns, Hadamard patterns, speckle patterns, Fourier pattern.

Transparent object reconstruction based on compressive sensing and SRCNN
Apart from this, the developed SRCNN networks take compressively reconstructed images as input, attempt to learn latent features, and then, reconstruct the distorted image as an output with enhanced features. In our study, an image data generator has been utilized to artificially create the required dataset from the real and reconstructed images. The expanded dataset having variations in the reconstructed images results in a skillful network model. For training the network, the dataset is split into two: one for training and the other for validation. The dimension of the images will cause intense computation and a long processing time. Hence, a down-sampling is performed on the images to lower the size. After that, a training dataset is used for training the SRCNN network, and a validation set is used for evaluating the accuracy of the model.
As a pre-processing step, the images are rescaled to 40 percent of the original size employing bicubic interpolation. Consequently, rectified linear unit (ReLU) activation and max pooling operations are performed on the interpolated images in the convolution layer followed by predictions regarding the model accuracy. The algorithm uses a pre-trained model. A series of convolution layers in the SRCNN network offer better learning and generalizing of the data consistency. A total of 14 convolution layers along with up-sampling and down-sampling activities are included in the model to increase the performance. Multiple layers enable the network to understand more detailed and abstract relationships within the data, as well as how the features interact with each other. The framework of the proposed CS-SRCNN network is shown in Fig. 5  Necessary library functions such as numpy, cv2, tensorflow, matplotlib, sklearn, and pickle are imported for processing the images. In our study, we established a dataset that contains 500 transparent object images of size 128128. Initially, 10 camera images of transparent objects and an equal number of single pixel imaging (SPI) images of transparent objects are obtained. Then data augmentation including shifting, flipping, rotating, and brightness is performed. After the augmentation process, the training dataset contains a total of 500 images, including 400 camera images and 100 single pixel images. The images in the dataset are rescaled to a fixed size of (80, 80, 3) followed by the normalization. Then, the dataset is randomly divided into 80:10:10 for training, validation, and test, respectively. With both the camera images and the SPI images included in the dataset, the network is able to uncover more hidden features and reconstruct the enhanced output images to resemble the original image. Our dataset consists of both camera images and single pixel images. Hence, in the training process, camera images are rescaled to 40% original size, which is the same size as low-resolution images during the CS process. After the network is well trained, the actual input to the network will be low-resolution images directly during the CS process. Later, a customized function lowers the resolution of the images without reducing their size. In a traditional SRCNN network, low-resolution images are turned into high-resolution images via feature extraction, non-linear mapping, and reconstruction phases. The low-resolution components are first upsampled to a suitable resolution using bicubic interpolation. Convolution layers enable the SRCNN network to improve image quality after that. Compared with the traditional SRCNN network, ten convolution blocks are present in the proposed network consisting of several max-pooling, up-sampling, and down-sampling operations in the middle. Each convolution block combines convolution, ReLU, the activation function, and max-pooling layers as Because down-sampling distorts the training set for the SRCNN model, the network model is trained to reintroduce the lost information in the output with much better clarity. For both the camera and the single pixel transparent object images, the network is trained for 100 epochs. By keeping positive values fixed and using ReLU as an activation function, we are able to detect rapid changes in the image. The filter size for each layer has remained constant (33), but the number of filters used to generate different feature maps has varied which can be seen in Fig. 6. Kernel-regularizers, a parameter with a value ranging from 10 3 to 10 10 , is also supplied. Under varying learning rates, changes in the image quality have been observed. Switching to Adam optimizer with a learning rate of 10 10 in all layers improves performance. The mean square error (MSE) loss function is used to calculate the difference between the SRCNN output and the ground truth single pixel image. It is calculated using the information provided in the formula where n is the training data and X is the original image. The HR image ( ; ) I F Y  is obtained by taking LR images ( ) i X as input and going through the mapping function ( ) F  . The optimization of the model is performed by using 1 1 where t g represents the gradient, t m and t v are momentum and variance, which are biased towards zero. The decay values 1  and 2  are close to one. The weight is initialized randomly for the network model, and it is updated through backpropagation. The result obtained is good with a validation loss of 0.002 8 and an accuracy of 91%. We employ early stopping during the training process to stop training when a monitored metric has stopped improving to minimize the loss in case of too many epochs which leads to overfitting of the training dataset, or a smaller number of epochs which results in an underfit model. Then, the predicted images are displayed along with train images for similarity testing. The obtained results are shown in Fig. 7. The network model is implemented in tensorflow while the image augmentation is done using Keras deep learning library and the network is trained on the Google Collaboratory platform.

Super-resolution convolutional neural network
In the above sections, single-pixel imaging theory and experimental setup are described in detail. To image transparent objects qualitatively, the single pixel imaging technique and CS-SRCNN network are combined in this work. For image formation, experiments are conducted in the restricted lab environment for the recovery of the homogeneous transparent object with a smooth surface. For imaging transparent scenes, the DMD modulates intensity transformed scene information which is transmitted from the object with patterns. Owing to the characteristics such as high fidelity, imaging speed, compressibility, edge preservation, and ability in providing large spatial information of the scene, Hadamard patterns are utilized in the imaging system. The image reconstructed with the pattern for different transparent objects is shown in Fig. 7.
The quality of the reconstructed image utilizing compressive sensing, as well as our proposed approach, is investigated. Image quality analysis is performed using the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [35]. The calculation is done based on the formula: where X is the SPI image, Y is the SRCNN reconstructed output, and N is the total number of pixels. PSNR, on the other hand, is defined as follows: SSIM is a one-of-a-kind method for determining how similar two recorded images are. It's a matrix that calculates the deterioration of image characteristics due to transmission processes such as compression and data deletion. The SSIM between X and Y can be calculated with the following formulation: , and 2 0.03 k  . L is a range of pixel values. Table 1 indicates the image quality of each object. Before the application of the SRCNN algorithm, the 2D images from the experimental system are reconstructed with the knowledge of the patterns. Though the system could recover images of transparent objects, the visual quality is being suffered from speckles and other noisy elements. Later, compressively reconstructed images are taken for training the network to improve the image quality. Results indicate that the combination of CS and SRCNN algorithms improvises the reconstructed image quality. Furthermore, the application of compressive sensing reduces the number of measurements required for 2D image reconstruction along with the reduction in the number of sensors. Overall, the combination of a single-pixel transparent object inspection system with CS-SRCNN works very well in capturing and recovering good-quality images.

Conclusions
In the current research, a transparent object reconstruction method is proposed based on CS and SRCNN that can detect and recover homogeneous transparent objects. The CS result from the SPI system has been used for the post-processing of the images during network training. The resultant images obtained from the CS-SPI system are good in quality and could preserve the edges of the transparent objects, and this is a limitation in conventional imaging systems. Then, the SRCNN algorithm has deduced visually enhanced HR images from being trained with augmented images and experimental results. Image augmentation plays a crucial role in this work to expand the dataset in the absence of a sufficient number of train images. The result indicates the efficiency of the system and SSIM and PSNR values of the SRCNN reconstructed images are improved to 0.53 and 15, respectively. That means the application of the CS-SRCNN network on single-pixel transparent object images improves the visual quality of the recovered images to an extent. However, the edges of the SRCNN reconstructed images are uneven which will be addressed in the future along with an improved 3D reconstruction algorithm.