Introduction

Vision plays the most important role in information acquisition [1], and camera which is the most important mean besides the human eyes is essential for the acquisition of visual information. Camera researchers are faced with significant challenges of how to effectively achieve high-performance imaging [2,3,4,5,6,7,8,9,10,11,12,13,14,15], including wide-field high-resolution imaging [2, 5], high frame-rate imaging [10], and high dynamic range imaging [12]. A significant strategy that is used in cameras is to build a bridge between the parallel cameras and the wide-field-of-view (FOV) high-resolution imaging. Unfortunately, natural/artificial compound eyes [16,17,18,19,20,21,22,23,24] are suffering from a short line of sight and small numerical aperture that result in low spatial resolution. A study has also illustrated that if the spatial resolution of compound eye increases to the same level as the human’s eye, the radius of the whole lens is supposed to be at least 1 m [25]. Fortunately, facing the real-world scenery reconstruction tasks, array cameras [26,27,28,29,30] pave the way for smarter and more advanced imaging. Pan-and-scan panoramic techniques are initially used in wide-FOV imaging [31], but the extension of this method may only be feasible at extremely low frame rates (e. g. GigaPan Time Machine [32]). As a typical example, LSST [33] has a single optical lens, but uses 189 scientific sensors to capture an image with 3.2 GPs. As such, benefiting from multiscale design, David’s AWARE-2 [2], AWARE 10 [34] and AWARE 40 [3] cameras have already driven a transition from small-scale to large-scale spatial sampling. As an example, AWARE-2 uses 98 cameras to improve the data throughput and spatial resolution at three frames per minute. Moreover, the improved RUSH [5] with 35 CMOSs and modular hierarchical array camera [28] with 20 cameras, are no longer limited by the large overlapping-FOV. Researchers in Stanford University [10] have achieved remarkable results in cost control, including utilizing cheap cameras to build the system, and 4 large-PC platforms are also required to operate at the same time. More recently, mantis camera [4] with 18 cameras has simplified the complexity, but a relatively large and expensive electronic system is still required. In a word, the existing systems still follow the principle of digital zoom systems with high pixel count and high cost. Computational imaging may transform the central challenge of photography from the question of where to point the camera to that of how to achieve higher-performance imaging. Thus, if there is exactly a feasible solution to the above problems, the optical zoom obviously becomes an inexpensive and convenient answer. Understanding the direct transformation from digital zoom with high pixel count to optical zoom in parallel cameras has been a long-standing challenge with great scientific and practical importance. Optical zoom that magnifies details without changing the back working distance is very desirable for improving the imaging capability. Nowadays the existing optical zoom systems [35, 36] usually utilize the mechanical movement of multiple solid optical elements to amplify high-resolution details, at a speed of a few seconds. Adaptive lenses, such as elastomeric membrane lenses [37, 38], electrowetting lenses [39,40,41], and liquid crystal lenses [42,43,44], can be used for building optofluidic zooming systems [45,46,47,48]. However, the disadvantage for the above-mentioned existing systems necessitates an extended axial dimension as well as complex driving systems. The existing zoom systems can only magnify the central area of FOV and are incapable of magnifying the detail in marginal FOV. Herein, one of the key problems is how we can make the optical zoom in marginal FOV possible for parallel cameras. How exactly we can deal with the problems with a convenient and effective way has become a crucial challenge. Such a system, to the best of our knowledge, has never been achieved.

Here we propose a deep learning-based parallel camera with 4 × computational zoom that learns optical zoom, with an 8-μrad instantaneous FOV (IFOV) and 33-ms zoom speed, which uses 6 cameras to capture snapshot, 30-MPs images at 30 frames per second (fps). In this study, we have abandoned the high-pixel mode relying on a number of subarrays, and find a new way to replace the above method with an economical deep-learning model, which has competitive advantages over the existing zoom systems. Considerately, existing challenges are how array cameras can realize the zoom operation of any local area in the whole stitched FOV, especially the marginal FOV in each camera. We know yet no array camera can meet this standard, making optical zoom in marginal FOV possible. Hence, we present an end-to-end model, calculating ideal function from short-focus imaging to long-focus imaging over a stitched FOV, which dramatically reduces the number and complexity of subarrays. Benefiting from deep learning, the innovation is that both the array camera itself and the electronic computing equipment can be simplified. Our system has already proved a ~ 100 × improvement in zoom time comparing with the conventional systems, independent of any optical components. For example, the traditional zoom systems usually take a few seconds to zoom, but ours only takes ~ 33 ms.

Results and discussion

Principle and concept

The concept of deep learning-based parallel (DLBP) camera is inspired by mantis shrimp compound eyes and zoom camera. The DLBP camera provides an approach to make the real-time computational zoom possible over any area of the stitched FOV, especially at the edge of FOV that is sacrificed helplessly in the conventional zoom systems.

In nature, insect compound eyes are comprised of neatly arranged ommatidia, which is of great significance for a larger stitched FOV, as illustrated in Fig. 1a. Whereas, the existing compound eye imaging systems are with a fixed focal length and low numerical aperture, resulting in low resolution (LR). Hence, the zoom principle of the camera improves the resolution (Figs. 1b, c). As far as we know, nevertheless, no array camera reported combines the characteristics of the stitched FOV and optical zoom. Figure 1d illustrates the functions of the proposed DLBP camera, and the stitched FOV with real time is defined as follows:

$${FOV}_{\{{t}_{i}\}}=N\times {FOVs}_{\{{t}_{i}\}}$$
(1)

where N denotes the number of the cameras, FOVs denotes the single FOV of the camera and ti is each frame stitched by time series.

Fig. 1
figure 1

Concept and principle of the DLBP camera. a Compound eye of a mantis shrimp, with stitched ommatidia. b Zoom lens. c Schematic of the zoom lens. d Schematic of the real-time imaging with parallel transfer, image stitching, and computational zoom in any area. e Overall architecture of the end-to-end model. The black arrows indicate the convolutional operations. The red arrows indicate coordinate attention operations. The green arrow denotes deconvolution operation, and ☉ denotes the multiplication operation

Inspired by the principle of zoom lenses, deep learning enables the DLBP camera to calculate the mapping for short-focus to long-focus imaging. As shown in Fig. 1d, the DLBP camera cuts the scene into multiple sub FOVs, and each one of them covers a part of the scene information. Stitched movie denotes real-time stitched image with wide FOV. The pretrained model is operated on an interactive platform, where mechanical deflection with driver is replaced over the stitched FOV, which would be unavailable to succeed over the conventional zoom systems. Tunable focal length (F1-F2) is obtainable using deep-learning model that learns optical zoom, which is advanced both in zoom responsiveness and spatial resolution.

Figure 1e illustrates the overall architecture of the pretrained model in Fig. 1d, including feature extraction, shrinking, non-linear mapping, expanding, coordinate attention and deconvolution operation, where m = 4 in non-linear mapping layer. Here Parametric Rectified Linear Unit (PRELU) is selected as the activation function and Mean Squared Error (MSE) is as the loss function. Coordinate attention (CA) mechanism is applied to strengthen the attention to feature information, which improves the imaging performance on the basis of ensuring the real-time super-resolution of real-world photography. PRELU is introduced to calculate the output of each layer, preventing overfitting. Real world super-resolution imaging puts forward high requirements for the practicability of the network. Given a set of wide-FOV images {IrealWF} and Ground Truth (GT) images {IrealT}, then we can get the corresponding low-resolution images {IrealL}, the optimization objective is calculated as:

$$\begin{array}{c}\mathrm{Min}\\ \theta \end{array}\frac{1}{n}{\Sigma }_{i=1 }^{n}|| F \left({{I}_{real}}^{L} ; \theta \right)- {{I}_{real}}^{T}{||}_{2}^{2}= \begin{array}{c}\mathrm{Min}\\ \theta \end{array}\frac{1}{n}{\Sigma }_{i=1}^{n} || {{I}_{real}}^{S} -{{I}_{real}}^{T} {||}_{2}^{2}$$
(2)

where IrealL and IrealT are the i-th LR and GT image pair, and F (IrealL; θ) is the network ψ(·) output for IrealL with parameters θ, IrealS is the super-resolution image. All parameters are optimized using optimization function. More details are illustrated in Appendix 1.

The model parameters in each layer are described in Supplementary Table S1. The parameters of conv and deconv are: k—the filter size, c—the number of channels, s—the stride, and p—padding. Similarly, r in the attention layer represents the zoom ratio. Our work is performed on a PC platform (Intel Core i5-8600 K CPU @3.6 GHz + GTX1070) equipped with Windows10 operating system.

Developed system

The DLBP camera is a highly scalable camera array that is scalable in scale, weight, power and cost. As illustrated in Fig. 2a, the DLBP camera is mounted in a 0.4 m × 0.4 m × 0.15 m frame, including 6 cameras and gimbals. Each camera is fixed on a gimbal driven by the voltage, and the angle of the camera is moderately adjustable to maximize the degree of freedom. The DLBP camera body is connected to Peripheral Component Interconnect Express (PCIE) of host using gigabit network cables, where each camera is equipped with a SONY 335 CMOS with a 2-μm pixel. Herein PC and PCIE are respectively responsible for computing and transferring data. The DLBP camera shares a local area network (LAN) for communication, and the stitched example is displayed with < 300 ms latency. Here we describe the DLBP camera, with an 8-μrad IFOV and 4 × computational zoom at 30 fps, which uses 6 cameras to capture snapshot, 30-MPs images at 30 fps.

Fig. 2
figure 2

Developed system. Developed system includes the system body, transfer module and computing module

Image formation pipeline

The concept that image formation pipeline refers to obtaining a computational-zoom result from a stitched FOV. Image formation pipeline encompasses three components including parallel transfer/stitching, smart monitoring and computational zoom in any area of stitched FOV. Benefiting from the independence of cameras, we eliminate the overlapping requirement of stitched FOV comparing with the conventional systems (such as ~ 30%, AWARE2), and the computations on cameras can be independently operated so that flexibility can be improved. Additionally, stitching robustness is no longer restricted by texture information, because stitched pipeline only depends on the pixel position and the camera position. The most important point is that the saved FOV can focus on covering richer information, which dramatically reduces the hardware cost and simplifies the system. More details are illustrated in Appendix 2.

The DLBP camera produces 30-MP image coded in H.265/H.264 format with 1–36 Mbps bitstream. As illustrated in Fig. 3, the example is captured using the DLBP camera, and the stitched frames (a-c) at 5 s, 8 s and 15 s are visualized, which are composed of sub images captured using 6 narrow-field cameras. The extracted insets (d-f) illustrate the details at the seam position. It is worth noting that the body of a football player is divided into two parts by the red line, which is exactly the seam. The snapshot insets have demonstrated the transfer speed of each camera, which is completely synchronized because the parallel transfer is achieved. Additionally, the experimental results (g-i) have demonstrated that a continuous stitching is realized. Here the red lines in stitched frames denote the stitching seams in adjacent mosaic images. Supplementary Movie S1 is an example about the parallel transfer.

Fig. 3
figure 3

Parallel-transfer and stitched-FOV frame captured using the DLBP camera. a-c Stitched frames at different instant (5 s, 8 s and 15 s), which are stitched by 6 sub images and prove that the camera has overcome the objective challenge of synchronous multiplex transfer. d-f Labeled regions from stitched frames, which denote that each channel follows the principle of synchronization. g-i Labelled regions from stitched frames, from which large-FOV stitching is realized. The red dotted lines represent the position of stitching seams

Panorama with computational zoom and super resolution

As shown in Fig. 4a, the panorama sample is captured at Taikoori Li Square Chengdu, covering ~ 0.3–4 km. The stitched frame depicts a panoramic view of downtown Chengdu, from which we can observe the local super-resolution details in real time through a pre-trained model. As such, the challenge of photography from the question of where to point the camera is transformed to that of how to achieve high-performance computational zoom. The advantage of the DLBP camera that has abandoned high-pixel-count and high-cost pattern, has been illustrated in Figs. 4b-j. The interaction example gets rid of the constraints of mechanical moving, driving and inability to optical zoom at the edge FOV, with a 4 × computational zoom at 30 fps. Figures 4b-j depict the super-resolution results of labeled regions at the distance of 350 m ~ 4000 m, which is not readable without computational zoom and super resolution (Figs. 4e and h).

Fig. 4
figure 4

Interactive panorama example captured using the DLBP camera. a Stitched panorama, which is stitched by 6 sub images. b-d SR reconstruction images with 4 × computational zoom. e Labelled region from panorama f-g SR reconstruction images with 4 × computational zoom, which recovers the rich information from short-focus to long-focus imaging. h Labelled region from panorama. i SR reconstruction image with 4 × computational zoom. j Sky eye satellite map. k Comparison of ours with the conventional systems.

The panorama is captured using the DLBP camera, with 30 MPs at 30 fps, which is stitched by 6 sub-images and covers about 150° FOV. Figure 4b provides the hotel information, magnifying details of the hotel’s exterior. Figure 4c demonstrates the details of the periphery of a shopping mall, where the number of the fences (23) on the roof can be easily distinguished. The experimental result in Fig. 4d shows that when the test distance exceeds 4 km, the performance of the model is greatly reduced. Figure 4f provides sufficient evidence of the advantages of computational zoom, but the mosaics and blurs are inevitable if the image is directly digitally magnified by M times (Fig. 4e). An example HDR image is illustrated in Fig. 4g, the brightness of this scene varies from the regions of fully sunlit building to the street areas of deep shadow. Comparing with computational zoom, the distorted mosaic images can be provided in Fig. 4h (digital zoom). Figure 4i provides accurate information, including the number of zebra crossings on the road (27). Supplementary Movie S2 is an interactive example in visible light captured from the interface. In the provided Supplementary Movie S2, our computational zoom strategy only takes ~ 33 ms, however, the traditional zoom systems usually take a few seconds to zoom.

We assembled the DLBP camera on the top floor (left) of the building (200 m) to view the street in real time. The sky eye satellite map (Fig. 4j) shows that the distance is about 300 ~ 400 m. The scale is estimated from the satellite map. Figure 4k illustrates our strategy which shows competitive advantages in covered information (FOV × resolution), zoom speed and capability. The super-resolution imaging advantage in the infrared light for the DLBP camera is also confirmed in Appendix 3. Supplementary Movie S3 is an interactive example of infrared light captured from the interface.

Methods

Image formation strategy

Array cameras with the conventional image stitching algorithms are limited by overlapping FOV (~ 30% in AWARE2), in which complex registration is one of the key challenges. Furthermore, the stitching methods with feature points do not work well on areas where the texture is not obvious, and the time comes at a great cost. To overcome the existing challenges, we explore a real-time image formation strategy, which implements the mapping of input pixels to composite pixels. Parallel computing can be handled by CUDA interface [49].

Crowd identification and tracking algorithm

A large number of examples are captured to verify our camera, where an example is captured and described in Supplementary Movie S4. For the crowded scene SCU East Stadium, a feature recognition algorithm [50] is introduced to locate the human. People on the move are tracked in real time using KCF algorithm [51]. While some people have their backs to the camera, the algorithm works well because athletes playing basketball are at a suitable mobile frequency and scale. We will continue to enrich our application scenarios and further improve the accuracy rate in future work. More details about group monitoring are presented in Appendix 4.

Dataset production

Real-world datasets are captured to train our end-to-end model with a 4 × computational zoom. The production of high-quality datasets is a key factor affecting super-resolution reconstruction. Long and short focus images are slightly misaligned when the zoom system is in the zoom process, the rough alignment and cropping can cause artifacts. Given a pair of short-focus and long-focus image, we regard the long-focus image as positive sample and short-focus image as negative sample. The corresponding information of positive samples can be obtained from negative samples using image registration technology. Here we define long-focus image as Ground Truth (GT), the GT image and the LR image are performed as a pair of data. More details about dataset production are presented in Appendix 1. The comparison results with the traditional systems and methods are illustrated in Appendix 5 and Supplementary Table S2.

Conclusion

The DLBP camera is inspired by mantis shrimp compound eyes and zoom camera, with high scalability, flexibility and robustness. Compared to the conventional zoom systems or array cameras, the DLBP camera has competitive advantages, (1) it learns optical zoom using a deep-learning method that is not dependent on any components, to recover the ideal imaging in required focal-length. (2) it replaces optical deflection (with an invariant optical-axis) in marginal FOV in array camera, breaking the zoom rule of array camera imaging. (3) it covers more information including FOV and spatial resolution, which avoids the requirement of overlapping FOV and is not sensitive to texture areas, with high scalability. (4) it has improved ~ 100 × in zoom responsibility, which is of great significance to activities requiring fast zoom.

The developed DLBP camera breaks the optical-zoom rule, with an 8-μrad IFOV and 4 × computational zoom at 30 fps, which uses 6 cameras to capture snapshot, 30-MPs images at 30 fps. In this paper, with the experimental system described in this work, the DLBP camera provides a new strategy to solve the inherent contradiction among FOV, resolution and bandwidth.